1. Introduction
There are limited features available for small targets during underwater detection and recognition. The feature extraction network downsampling process suffers from the problem of disappearing feature gradients. This problem seriously affects the recognition accuracy of small targets underwater. Existing algorithms usually extract multi-scale features of the target to solve the problem of disappearing feature gradients. However, this approach significantly increases the computational effort of the algorithm [Reference Lu, Li, Ding and Guo1]. By adding voids to the convolution kernel through dilated convolution, the resolution of the feature map can be increased without increasing the computational effort. However, extended convolution also suffers from the problem of “gridding,” which can cause a partial loss of adjacent information [Reference Wang, Chen, Yuan, Liu, Huang, Hou and Cotrell2].
The underwater environment is poorly lit and the water is murky. This reduces the clarity of the underwater image and prevents the autonomous underwater vehicle (AUV) from obtaining complete information about the target features in the acquired [Reference Sun and Cai3]. The graph convolutional neural network (CNN) can effectively learn the spatial semantic features of the targets by capturing the inter-target dependencies through information transfer between nodes. The lack of features in the target can be compensated by spatial semantic features, improving the recognition accuracy of underwater blurred images [Reference Cai, Chen and Chai4]. A suitable correlation matrix is the key to extracting spatial semantic features accurately. However, existing algorithms usually construct correlation matrices from the label co-occurrence relationships, which are weak in generalization.
To address the difficulties of underwater small target recognition, we propose an enhanced dilated convolution framework for underwater blurred target recognition, as shown in Fig. 1. The method enables the recognition of small targets in the presence of blurred images.
The main contributions of the methodology in this paper are as follows:
-
1. The proposed method improves the features extraction network. The introduction of hybrid dilated convolution with different expansion rates improves the resolution of small target features. The method effectively solves the problem of small target feature disappearance without increasing the computational effort.
-
2. The algorithm proposed in this paper constructs an adaptive correlation matrix through two $ 1\times 1$ convolutional layers and a dot product operation and learns the spatial semantic relations of the targets through this matrix. The algorithm solves the problem of incomplete target feature information in underwater blurred images.
-
3. The proposed algorithm fuses the visual features and spatial semantic relations of the target and trains the network with a focal loss function. The algorithm effectively improves the recognition accuracy of small underwater blurred targets.
2. Related work
Rapid recognition of underwater targets is a key issue for autonomous AUV recognition. There is a tendency to miss detection during the recognition of small underwater targets. The paper [Reference Jian, Qi, Dong, Yin and Lam5] presents a new framework for underwater image saliency detection. The algorithm combines both a quaternion number system and principal components analysis to achieve superior performance. Kong et al. [Reference Kong, Hong, Jia, Yao, Cong, Hu and Zhang6] proposed an efficient feature extraction method, which effectively improves the real-time performance of the algorithm. For small targets in complex environments that are easily masked by other objects or noise, Wu et al. [Reference Wu, An, Chen, Qian and Sun7] proposed an open-closed transformation algorithm to eliminate or weaken the background and noise. This algorithm extracts the weakened features by eliminating noise to achieve the recognition of small targets, which effectively improves the recognition efficiency of small targets. This paper presents a three-stage FCA algorithm for HR. It is used to extract face features [Reference Gongor and Tutsoy8]. In terms of CNNs, Li et al. [Reference Li, Zhang, Xiang and Pan9] extracted features from high-resolution range profile (HRRP) and classified targets to achieve the detection of small targets. This target recognition has good generalization capability and stable performance. Cao et al. [Reference Cao, Hou, Gulliver and Lan10] proposed a wavelet neural network (WNN) to detect small low-altitude targets. The algorithm can detect multiple small targets at the same time. Wu et al. [Reference Shuang-Chen and Zheng-Rong11] proposed a new deep convolutional network for small targets in infrared images. The problem of small target detection is transformed into the classification of small target position distribution. Excellent results are obtained in different scenarios. In response to the fact that targets and backgrounds differ in some areas, He et al. [Reference He, Zhang, Mu, Yan, Wang and Chen12] proposed a multi-scale local gray dynamic range (MLGDR) method, which achieves a high signal-to-noise ratio and low detection rate in different scenes. The paper reports on a new multi-view algorithm that combines information from multiple images of a single target object. This algorithm is used for the binary classification of underwater images [Reference Kannappan and Tanner13]. Deng et al. [Reference Deng, Sun and Zhou14] embeds multi-scale fuzzy metric detection in complex backgrounds, and the algorithm eliminates a large amount of background folding and noise. Cheng et al. [Reference Cheng, Jiang, H.Li and Huang15] presents a method to improve the speed and accuracy rate for space robot visual target recognition based on illumination and affine invariant feature extraction and to reduce the effect of light and occlusion on target recognition. Li et al. [Reference Li, Zhang, Peng and Dong16] proposed a network framework (DMNet) incorporating dilation convolution and multi-scale mechanisms. The algorithm extracts image contextual information using the multi-scale mechanism and extracts small detail features using dilation convolution, which effectively improves the recognition performance of the algorithm. Wang et al. [Reference Wang, Hu, Wang, Chen and Pan17] used a single static image density estimation method for CNNs. A multi-scale expanded convolutional module was used to integrate the underlying detailed information into high-level semantic features to enhance the recognition capability of the network. The algorithm has excellent robustness. Fang et al. [Reference Jian, Liu, Luo, Lu, Yu and Dong18] constructs a multi-scale feature pyramidal fusion neural network based on dilated convolution, and the algorithm achieves faster recognition and tracking of targets. Experiments show that the algorithm has good convergence speed and generalization ability. These methods have effectively solved the problem of feature disappearance and improved the recognition ability for small targets. However, the recognition capability needs to be improved for targets with incomplete features.
Images captured in underwater environments often exhibit complex lighting and severe water turbidity [Reference Jian, Qi, Yu, Dong, Cui, Nie, Zhang, Yin and Lam19, Reference Fang and Liu20]. Also complex terrain obscures the image. These factors make it difficult for AUVs to acquire image information. In solving the occlusion problem, Shen et al. [Reference Shen, Zhao, Fan, Lian, Zhang, Kreidieh and Liu21] used graph neural networks to mine graph node relationships and CNNs to construct body part maps. The algorithm implements target detection of obscured pedestrians. Wei et al. [Reference Gama, Isufi, Leus and Ribeiro22] used curvilinear signal processing (GSP) to characterize the representation space of a graphical neural network (GNN), giving the GNN better observability. Fu et al. [Reference Fu, Fu, Wang, Dong and Ren23] designed guided graph CNNs with a new residual shunt structure to investigate the relationship between skeletal data and human actions. Lu et al. [Reference Lu, Chen, Zhao, Liu, Lai and Chen24] proposed a new model for converting semantic segmentation into graphical nodes. The model extends the receptive field and combines structure with feature extraction without losing location information. The approach validates the idea of combining graph structure with deep learning. Zhang et al. [Reference Zhang, Jin, Sun, Wang and Sangaiah25] extracted spatial and semantic convolutional features using CNNs to keep the spatial features at a high resolution, thus improving the accuracy of visual tracking effectively. The novel algorithm model of a hybrid network model based on CNN and long-short-term memory (LSTM) model is constructed [Reference Zhang and Zhang26]. To mitigate the data sparse problem, the paper [Reference Tian, Kang, Xing, Li, Zhao, Fan and Zhang27] combines object proposal with attentional networks for efficiently capturing salient objects and human attention regions in dynamic video scenes. Their proposed framework runs better than existing deep models on saliency detection databases. Tian et al. [Reference Li, Qiu, Chen, Mei, Hong and Tao28] designed a new contrast loss function. The SGEN architecture was used to train the contrast loss for spatial and semantic similarity. The algorithm effectively improves the object detection performance. Li et al. [Reference Yin and Hu29] proposed a new end-to-end semantic segmentation network that integrates lightweight space and channel attention modules, which can refine features to adaptively improve the lightweight space and channel attention modules. The experiments show that the algorithm could achieve better semantic segmentation results. Yin et al. [Reference Wang, Lan, Zhang and Luo30] designed an enhanced global attention decoder (EGAUD) that replies to detailed semantic information and makes predictions by enhancing the feature aggregation module for attention and semantic segmentation. A model of gated spaces and semantic attention headings is proposed in the literature [Reference Jian, Wang, Yu and Wang31], and experiments show that the algorithm is effective in terms of quantitative and qualitative results. The above methods have performed well in some tests but fall short for the recognition of small underwater blurred targets.
3. Proposed method
The paper extracts small underwater target features through a hybrid dilated convolution network, increasing the algorithm’s perceptual field without increasing the algorithm’s computational effort. The missing underwater target features of the target are compensated by learning the spatial semantic features of the target through an adaptive correlation matrix. Finally, the proposed algorithm fuses spatial semantic features and visual features for underwater blurred small targets recognition.
3.1. Network model
The underwater environment has problems such as low light and turbid water. These phenomena result in the loss of small target features underwater. The micro-target feature extraction network uses an optimized ResNet as the base network [Reference Avelin and Nyström32, Reference Zhang, Chen, Wu, Cai, Lu and Li33]. The network input is a $ 256\times 256$ three-channel image. Thirteen convolution kernels are used to convolution the input image. The convolution kernels are $ 3\times 3$ . The step size is 2. A 13-channel feature map is obtained. At the same time, maximum pooling of the input image can effectively preserve the original information of the image and speed up the training. The output result of maximum pooling is a three-channel feature map. The above two results are fused to obtain a 16-channel feature map, as shown in Fig. 2.
In order to maintain the resolution and perceptual field of the network, “holes” are inserted into the convolution kernel. This is called dilated convolution. An expansion filter of size $ k_d\times k_d$ is obtained. The convolution kernel is $ k\times k$ . where $ k_d=k+(k-1)\bullet \left (r-1\right )$ . The expansion module can obtain more fields of view with fewer network layers, effectively speeding up the training while keeping the feature maps of the output layer at the same resolution as the input layer. In this paper, we use hybrid dilated convolution to extract image features to solve the problem of incomplete local information and irrelevant information. The expansion rate of the convolution kernel is set to 1, 2, and 5, as shown in Fig. 3. The method improves the resolution of small target features.
The convolution layer is followed by batch normalization and exponential linear units (ELUs) [Reference Yang, Wei, Tu, Zeng, Kinsy, Zheng and Ren34]. The ELUs activation function speeds up learning and avoids gradient disappearance. The activation function is as follows:
The image $ x$ is the input to the feature extraction network $ Q$ , and $ Q$ output features as $ G=Q(x)$ .
3.2. Semantic space feature extraction
Underwater targets are blurred by turbidity and low light levels, resulting in a lack of information on target features. The correlation matrix represents the spatial semantic relationships between different targets. Existing correlation matrices are usually constructed from the label co-occurrence relations of the training set, and their generalization ability is weak. This paper designs adaptive correlation matrices to represent the semantic correlation between targets, as shown in Fig. 4. The adaptive correlation matrix module consists of two $ 1\times 1$ convolutional layers and a dot product operation [Reference Li, Peng, Qiao and Peng35]. The output learned label correlation matrix $ M$ is follows:
where $ W_\emptyset$ and $ W_\theta$ are denoted as convolution kernels, $ \ast$ is the convolution operation, $ D$ is the labeled word embedding vector, and $ C$ is the category. As some rare co-occurrence relations may be noise, a probability threshold $\tau$ is set to filter the noise in this paper. The filtered matrix is as follows:
In this paper, a spatial semantic feature extraction network is constructed. The network uses an adaptive correlation matrix to represent the semantic correlation between targets and updates the feature representation through information transfer between nodes. Spatial semantic feature extraction networks can all be represented as:
Initialize the spatial semantic features, denoted as $ f^l=G$ . $ f^{l+1}$ is the updated spatial semantic features. $ B$ is the normalized adaptive correlation matrix. $ W^l$ is the transformation matrix to be learned. $ l\left (\bullet \right )$ is a nonlinear Leaky ReLU activation function.
3.3. Enhanced dilated convolution framework for underwater blurred target recognition
After the visual features have been acquired, the target is recognized. First, the visual features and the spatial semantic features are fused. Due to the blurring of the underwater image, small targets cannot be effectively recognized by visual features alone. The graph neural network is used to capture the information of surrounding nodes, establish the connection relation between nodes, and extract the spatial semantic features of small targets. The extracted spatial semantic features are fused with visual features, which can effectively improve the accuracy of vision. Anchors are generated for the fused feature nodes. Each point is set with h anchors. Softmax function is used to obtain the determination of anchor frames and extract positive anchors. Bounding box regression regress positive anchors. The results are fed into the proposal layer to calculate the exact proposal. The Cls layer classifies the proposals. The reg layer regresses the proposals again to obtain the target anchor boxes.
On the basis of the acquired visual feature maps, this paper extracts the candidate regions of the target. The algorithm in this paper fuses spatial semantic features and visual features to achieve recognition of small targets. The candidate frame is considered as a node in the graph structure. The spatial semantic features $ f^{l+1}$ and visual features $ G$ of the nodes are fused, and the target type is predicted based on the fused results. The fused features are expressed as:
where $ F_P$ is a feature fusion output function. $ F_P$ maps the spatial semantic features $ f^{l+1}$ and the feature set $ G$ into the feature vector $ P_c$ . $ P_c$ includes two types of information about the target, respectively, spatial semantic information and visual features. $ P_c$ is entered into the fully connected layer in this paper to successfully predict the target category scores.
Target classification is performed by the cls layer. The final output is a $ C+1$ dimensional array $ Y$ , calculated by the SoftMax function. $ Y$ denotes the category confidence level of the target as:
Model training is carried out using a minimization loss function. The loss function includes regression loss and classification loss, which is shown in Eq. (7):
where $ c$ is the target category. $ R$ denotes the Smooth L2 function. $ i$ denotes the number of the candidate box. $ y_i^c$ denotes the confidence level of the category of anchor boxes $ i$ . $ T_i$ denotes the target anchor boxes coordinates, given by the regression layer. $ T_i^\ast$ is the target real area coordinates. $N_{\textrm{reg}}$ denotes equal to the number of anchor boxes. $N_{\textrm{cls}}$ denotes the minimum training batch size. $N_{\textrm{cls}}$ and $N_{\textrm{reg}}$ denote the normalization of the loss function. $ \sigma \left (\bullet \right )$ is the sigmoid function. $ \lambda$ denotes balanced weights.
4. Experiment
The datasets used in this experiment are all from the Underwater Target dataset (UTD), Cognitive Autonomous Diving Buddy (CADDY) underwater dataset, and Underwater Image Enhancement Benchmark (UIEB) datasets. The images in the dataset contain frogmen, submarines, torpedoes, and AUV types. The dataset has 11,560 labeled images with a ratio of 7:3 between the training and test sets. The training set trained the extraction model, and the test set tested the recognition network. Training and testing were carried out in TensorFlow under Win10. The simulations were run on a small server with a GTX 2080 GPU and 64G RAM.
This paper proposes an enhanced dilated convolution framework for underwater blurred target recognition. Facing the problem of blurred underwater images, we designed two sets of target recognition simulation experiments in this paper. The two sets of simulation experiments were conventional underwater images and blurred images, and the compared algorithms were CRSNet [Reference Li, Zhang and Chen36], DMNet [Reference Jiang, Lyu, Liu, He and Hao37], Improved RetinaNet [Reference Tian, Zheng and Jin38], and MobileNet-SSD [Reference Hu, Li, Li and Wang39]. The algorithms were evaluated in terms of recognition accuracy (mAP) and recognition time.
4.1. Clear underwater image recognition results
Figure 5 shows the target recognition results for clear underwater images. In Fig. 5, the rows indicate the recognition accuracy of the same algorithm for different targets. The six target types are torpedo, torpedowake, submarine, frogman, bubble, and AUV, respectively. Table I shows the recognition accuracy and recognition time of the algorithm for clear underwater image targets. The average recognition results of the five algorithms show our algorithm is the best with 0.7315. Our algorithm also has the highest recognition accuracy for frogmen and bubble targets with 0.7717 and 0.7477, respectively. However, the algorithm in this paper is slightly lower than MobileNet-SSD in recognition time with 0.208 s. CRSNet has the highest recognition accuracy for torpedoes and torpedo trails with 0.5641 and 0.6025, respectively. The DMNet algorithm had the highest accuracy in recognition of submarines at 0.9326. However, the average recognition accuracy of DMNet is weaker than the algorithm in this paper, and the algorithm in this paper is more advantageous for the recognition of underwater targets. The Improved RetinaNet algorithm is higher than this paper’s algorithm in terms of AUV recognition accuracy, at 0.9470. In terms of recognition speed, the MobileNet-SSD algorithm works best at 0.108 s. The analysis above shows that the algorithm in this paper is the best in terms of average recognition accuracy but less so in terms of torpedo and torpedowake recognition. As can be seen from the first column on the left side of Fig. 5, the algorithm in this paper has no misses in the recognition of small-scale torpedowake and torpedo. The algorithm in this paper is optimal for the recognition of small targets underwater.
4.2. Blurred underwater image recognition results
Figure 6 shows the recognition results of the algorithms in this paper for underwater blurred images. The four columns in the figure represent the recognition accuracy for six target types: torpedo, torpedo wake, submarine, frogman, bubble, and AUV under different algorithms. Table II shows the target recognition accuracy and recognition time of underwater blurred images. From the table, it can be analyzed that the algorithm in this paper has the best recognition effect when facing blurred images, with an average recognition accuracy of 0.7063. It remains the highest recognition rate in recognizing frogmen and bubbles with 0.7588 and 0.7732, respectively. The algorithm in this paper is also the highest in recognizing torpedoes with 0.5149. CRSNet has the highest accuracy of 0.6136 for torpedo wake recognition. Improved RetinaNet has the highest accuracy for recognizing AUV and submarine targets, with 0.8420 and 0.9262, respectively. In terms of recognition time, MobileNet-SSD maintains the fastest recognition speed at 0.115 s. The above data show that the algorithm in this paper has the highest mAP when recognizing underwater blurred targets.
4.3. Low light conditions blurred underwater image recognition results
Figure 7 shows the recognition results of the algorithm in this paper for underwater blurred images in low light conditions. The six columns in the figure represent the recognition accuracy of the six target types torpedo, torpedowake, submarine, frogman, bubble, and AUV under different algorithms. Analysis of the graphs shows that the confidence levels shown by each algorithm are relatively good and have high values when performing the recognition of torpedo, submarine, and AUV. The second column of Fig. 7 was analyzed. For torpedowake recognition, CRSNet successfully recognized torpedowake in low light conditions, which is excellent among the algorithms. However, the algorithm has poor recognition results for small-scale torpedoes. The algorithm in this paper has the highest confidence level for small-scale torpedo recognition, at 0.974. For the analysis of the fourth column, CRSNet, Improved RetinaNet and the algorithm in this paper recognized all the frogman, and bubble. The algorithm in this paper had the highest confidence level for the recognition of frogman and bubble, with 0.992 and 0.998, respectively. From the above analysis, the algorithm in this paper has better results in recognizing low light conditions for blurred underwater images.
The black bolded font in the table indicates the excellence metrics for each algorithm.
The black bolded font in the table indicates the excellence metrics for each algorithm.
4.4. Experience analysis
The experimental results are analyzed in terms of underwater blurred small target recognition. The algorithm in this paper has the highest average recognition accuracy among the compared algorithms. CRSNet has the longest recognition time, but the average recognition accuracy is only lower than the algorithm in this paper, and the recognition results are also very positive. The average recognition accuracy and recognition time results of DMNet and Improved RetinaNet for small underwater blurred targets are smaller than those of CRSNet. The difference between the average recognition accuracy and recognition time of DMNet and Improved RetinaNet is smaller. MobileNet-SSD has the best recognition speed, but the average recognition accuracy is less effective.
5. Conclusions
Underwater images are blurred due to environmental and light disturbances, and AUVs are challenging for the recognition of small underwater targets. We propose an enhanced dilation convolution framework for underwater blurred target recognition. Firstly, the method extracts small target features through a hybrid dilated convolution feature extraction network, increasing the perceptive field of the algorithm without increasing its computational power. Secondly, this paper learns the spatial semantic features through an adaptive correlation matrix to compensate for the missing features of the target. Finally, this paper uses the fusion of node features and spatial semantic features to achieve the recognition of small blurred targets. The average recognition accuracy of the algorithm in this paper is 1.04% better than existing methods.
Funding
This work was supported by National Key R&D Program of China (2019YFB1311002), and Science and Technology Project of Henan Province (212102210161, 222102320380, 222102110194, and 222102110205).