1. Introduction
The documentation of underwater cultural heritage is the basis for sustainable marine development, and underwater archaeology is great significance for historical and cultural transmission and preservation of underwater heritage. The purpose of scene resolution is to assign a class label to each pixel in the image, and effective scene resolution can provide important meaningful values for underwater archaeology.
In reality, in classical FCN [Reference Long, Shelhamer and Darrell1] networks, high-precision parsing of global and local information usually cannot be achieved simultaneously. To improve the information extraction of global context, Jie Jiang et al. [Reference Jiang, Liu, Fu, Zhu, Li and Lu2] proposed a novel global-guided selective context network (GSCNet) to select contextual information adaptively. Multi-scale feature fusion and enhancement network (MFFENet) [Reference Zhou, Lin, Lei, Yu and Hwang3], semantic consistency module [Reference Ma, Pang, Pan and Shao4], and attention residual block-embedded adversarial networks (AREANs) [Reference Yan, Wang, Bu, Yang and Li5] can fuse global semantic information at multiple scales. For information extraction of local information features, Shiyu Liu et al. [Reference Liu, Zang, Li and Yang6] highlighted the strong correlation between depth and semantic information by introducing a built-in deep semantic coupling coding module that adaptively fuses RGB and depth features. Junjie Jiang et al. [Reference Jiang, He, Zhang, Zhao and Tan7] proposed a graphical focus network that captures the local detectors of objects and relational dependencies. Zhitong Xiong [Reference Xiong, Yuan, Guo and Wang8] et al. adopted a novel variational context deformable module that learns adaptive environmental changes in a structured way to improve the recognition of local information. However, the existing methods are not able to decompose spatial details and contextual information learning better, and the accuracy of resolution for fuzzy features is not high.
To address the above issues, this paper proposes a global enhancement network underwater scene parsing method, as shown in Fig. 1. Our main contributions can be summarized as follows:
-
1. The receptive field in standard convolution is fixed and can only capture a uniform feature scale. However, underwater artifacts have different shapes and sizes. To extract features of multiple scales of target classes in underwater archaeological scenes, we propose adaptive dilated convolution. It can learn the dilated coefficients adaptively according to the size of the objects in the scene and can capture global features more flexibly and efficiently so that features with various sizes in the scene can be handled effectively.
-
2. The image features of the data collected under turbid water conditions are blurred, and different categories are easily confused in the process of classification. Traditional classification methods use a single classifier, which has little discrimination between features with high similarity. To improve the category classification accuracy, this paper introduces multiple classifiers and uses multiple convolutional layers for enhanced classification. A difference-based regularization method is used for feature categories with high similarity to enhance the probabilistic scoring differences between categories and improve classification accuracy.
-
3. To verify the effectiveness and advance of the proposed method in solving Underwater archaeology problems, we self-made the Underwater Shipwreck Scenes (USS) dataset and conducted experimental comparison between the current advanced method and the proposed algorithm on the USS dataset. Also, to verify the generalization of the proposed algorithm, we compare the proposed algorithm with the current state-of-the-art algorithm on general datasets (ADE20K and Cityviews) and underwater datasets (SUIM). The experimental results show that the proposed algorithm achieves good performance on different datasets.
2. Related work
In scene parsing, significant results have been achieved recently for the problem of missing some features due to the occlusion of the target. Some approaches [Reference Nie, Han, Guo, Zheng, Chang and Zhang9, Reference Ji, Lu, Luo, Yin, Miao and Liu10] build overall contextual relationships based on fully convolutional neural networks to capture more detailed features and thus improve the parsing performance. High-level semantic features in deep feature aggregation networks (DFANet) [Reference Li, Xiong, Fan and Sun11] can abstract effective features by successively combining low-level detailed features, capable of simultaneous large sensory fields and detailed spatial features. DecoupleSegNets [Reference Li, Zhang, Cheng, Lin, Tan and Tong12] improves the semantic segmentation performance by modeling the target body and edge explicitly. Semantic-aware occlusion robust network [Reference Zhang, Yan, Xue, Hua and Wang13] used the intrinsic relationship between the recognition target and the occluded part to infer the missing features. Semantic guidance and estimation network (SeGuE-Net) [Reference Liao, Xiao, Wang, Lin and Satoh14] can reconstruct and repair the missing data. L. Cai [Reference Cai, Qin and Xu15] proposed an enhanced dilated convolution framework for underwater blurred target recognition. X. Qiao et al. [Reference Qiao, Zheng, Cao and Lau16] adopted a deep neural network to solve the problem of phantom missing in the attributes of independent objects contextual information problem. Context-based tandem network (CTNet) [Reference Li, Sun, Zhang and Tang17] can effectively combine global and local information to improve the performance of semantic segmentation. Semantic structure aware [Reference Sun and Li18] proposes a semi-supervised semantic segmentation algorithm based on semantic structure awareness. By exploiting the relationship between different semantic structures in the training data, the weakly supervised information can be transformed into strongly supervised information, thus improving the pixel-level dense prediction accuracy without increasing the cost. Dilated convolution with learnable spacings (DCLS) [Reference Khalfaoui-Hassani, Pellegrini and Masquelier19] can increase the size of the perceptual field without increasing the parameters. The interpolation technique is used to flexibly determine the spacing between non-zero elements, or equivalent positions, by backpropagation learning. Adaptive fractional dilated convolution network (AFDC) [Reference Chen, Zhang, Zhou, Lei, Xu, Zheng and Fan20] with aspect ratio embedding, component preservation, and parameter-free features. The method adaptively constructs fractional-order dilated kernels based on the aspect ratio of the image and uses the two closest integer-order dilated kernels to interpolate to solve the misalignment problem of fractional-order sampling. Adaptive dilated convolution (ADC) [Reference Luo, Wang, Huang, Wang, Tan and Zhou21] can generate and fuse multi-scale features of the same spatial size by setting different dilated rates for different channels. It enables the ADC to adaptively adjust the fusion scale to better fit the various sizes of the target class. However, the receptive field size of the above model is fixed, and only a uniform feature proportion can be captured for objects with inconsistent size.
Deep water environments are poorly lit and sediment is present, making it difficult to accurately identify features. To improve the semantic performance of degraded images, X. Niu et al. [Reference Niu, Yan, Tan and Wang22] proposed an effective image recovery framework based on generative adversarial networks. The underwater distorted target recognition network (UDTRNet) [Reference Cai, Chen and Chai23] and the method of binary cross-entropy loss to extract abstract features of objects [Reference Cai, Chen, Sun and Chai24] improved the detection accuracy of underwater targets. Object-guided dual-adversarial contrast learning [Reference Liu, Jiang, Yang and Fan25] and multi-scale fusion algorithm [Reference Rajan and Damodaran26] can effectively enhance seriously distorted underwater images. Zhi Wang et al. [Reference Wang, Zhang, Huang, Guo and Zeng27] proposed an adaptive global feature enhancement network (AGFE-Net) that used multi-scale convolution with global receptive fields and attention mechanisms to obtain multi-scale semantic features and enhance the correlation between features. Asymmetric non-local neural networks for semantic segmentation (ANNNet) [Reference Zhu, Xu, Bai, Huang and Bai28] designed an asymmetric Non-Local approach to computing point-to-point similarity relationships efficiently and aggregating global information and context. W. Zhou et al. [Reference Zhou, Jin, Lei and Hwang29] proposed a common extraction and gate fusion network (CEGFNet) for capturing high-level semantic features and low-level spatial details for scene parsing. SegFormer [Reference Xie, Wang, Yu, Anandkumar, Alvarez and Luo30] is capable of outputting multi-scale features and aggregating information from different network layers. Y. Sun et al. [Reference Sun, Chen, He, Wang, Feng, Han, Ding, Cheng, Li and Wang31] proposed a model fine-tuning method based on singular value decomposition, which can perform fast model fine-tuning by using a very small number of parameters, thus achieving the segmentation task with few samples. Z. Li et al. [Reference Li, Tang, Peng, Qi and Tang32]proposed a knowledge-guided approach for few-sample image recognition. A knowledge-guided model based on an attention mechanism is used to achieve the classification of few-sample images in the target domain by fusing and utilizing knowledge from multiple source domains. Semantic Segmentation of Underwater (SUIM Net) proposed by M. J. Islam et al. [Reference Islam, Edge, Xiao, Luo, Mehtaz, Morse, Enanand and Sattar33], improves the performance of semantic segmentation using a fully convolutional encoder-decoder model. Several studies [Reference Wang and Yang34–Reference Liu and Song36] combined conditional random fields with deep CNNs, with significant improvements in segmentation accuracy and generalization performance. These scene resolution methods can effectively enhance image features, but ignore the similarity of fuzzy features in the classification process is prone to misclassification.
In this paper, we propose adaptive dilated convolutional networks with flexible size and shape receptive fields so as to tackle objects of different sizes in the scene. We also propose an enhancement classification network that can effectively improve the classification accuracy of fuzzy features. In addition, the proposed method in this paper can effectively capture both global and local contextual information.
3. Our approach
The proposed method contains four parts, as shown in Fig. 1. The first part is adaptive dilated convolution feature extraction model $P_{1}$ . The global and local features of the scene are extracted by the method of adaptive dilated convolutional network. The second part is the contextual feature encoding model $P_{1}$ . This part is based on the overall scene and regional features to learn their contextual relationships. The third part enhancement classification model. For confusable objects, an enhancement classifier is used to discriminate the correct object class. The fourth part is the scene parsing model. Based on global and local features and contextual relationship features, classification is performed to get the final scene parsing results.
3.1. Adaptive dilated convolution feature extraction model
Let the input features be $X^{(i)} \in \mathbb{R}^{H \times W \times D}$ , where $i=1, \ldots, N$ , N is the total number of scene elements in the dataset; H and W are the height and width of the image, respectively; D is the number of channels of the image. Firstly, we need to extract the features from the input elements $X^{(i)}$ . The feature extraction model is as follows:
where $\alpha ^{(i)}$ is the feature vector of $D_{\alpha }$ dimension, which encodes the semantic information of the overall scene. $\left \{\beta _{j}^{(i)}\right \}_{j=1, \ldots, J}$ is the feature vector of $D_{\beta }$ dimension, which encodes the local semantic information. J is the number of feature regions in the scene, and $\xi _{1}$ is the parameter of feature encoding model $P_{1}$ .
In the process of underwater archaeology, underwater heritage exists buried by mud and sand, and some features are missing, which brings great challenges to the extraction of features. Extracting the overall scene features $\alpha ^{(i)}$ and local features $\left \{\beta _{j}^{(i)}\right \}_{j=1, \ldots, J}$ is a prerequisite for generating accurate scene parsing. Convolutional neural network (CNN) can extract high-level semantic features better. However, in the standard convolution, the field of receptive is fixed, and features with different sizes of objects cannot be extracted simultaneously. In contrast, dilated convolution adjusts the size and shape of the convolution block by using the learned dilated coefficients, and has a flexible size and shape of the receptive field. At the same time, multi-scale feature information is extracted without loss of spatial resolution. To extract both overall and local features, we propose an adaptive dilated convolutional network. In the paper, the proposed method removes the two pooling layers in the convolutional neural network to obtain more detailed local features, to obtain higher-resolution feature maps. This paper adaptively adjusts different dilated coefficients according to task requirements. For small target classes requiring detailed feature learning, a smaller dilated coefficient (less than 1) is used to learn more detailed local features $\left \{\beta _{j}^{(i)}\right \}_{j=1, \ldots, J}$ . For large target classes that require global features, use a larger dilated coefficient (greater than 1) to learn the broader overall feature $\alpha ^{(i)}$ .
Figure 2 provides the overview of the proposed adaptive dilated convolution. Assume that the input and output features of the dilated convolution are $X^{(i)} \in \mathbb{R}^{H \times W \times D}$ and $F^{(i)} \in \mathbb{R}^{H \times W \times D}$ , respectively. The dilated coefficients are learned from an additional regression layer that takes the feature map X as input and outputs an dilated coefficient map $\mathrm{R} \in \mathbb{R}^{H \times W \times 8}$ . The size and shape of receptive field are adjusted by the vector coefficients in R. Since the position of the dilation coefficient vectors is adaptive, the dilated coefficient maps R has the same size as the output feature maps F. Thus, all the convolution patches have their respective dilated coefficient vectors to obtain the receptive field of the target size and shape.
We use formula (2) to initialize the dilated regression layer:
where $\eta _{0}$ and $\mu _{0}$ denote the initialized deviations of the convolution kernel and the regression layer, and $a$ denotes the position in the convolution kernel. The convolution kernel is set to a value close to 0 and the bias is 1, so the generated dilated coefficients are almost approach to 1. The training starts with the standard convolution, and the adaptive dilated coefficients are progressively learned during the training process to obtain the appropriate values.
Reconstructing convolutional patches extracts a set of patches from the feature map, each of which has a certain shape and size. The number, size, and shape of the convolution patches are related to the size of the input feature map and the receptive field size of the network, which are adaptively reshaped based on the dilated coefficient R. These patches are then reconstructed into a new feature map by interpolating between the patches to fill in the missing pixels. (The missing pixels in the patches refer to the areas in the reconstructed feature map that may have gaps or pixel values that were not directly observed or sampled. These missing pixels occur because the patches are extracted from the original feature map, and during the reconstruction process, interpolation is used to fill in the gaps between the patches). This approach allows the network to consider larger scenes when performing convolutional operations and can increase the perceptual field of the feature map. In traditional convolutional neural networks, reducing the size and resolution of the feature maps by using operations such as pooling layers and step convolution can lead to loss of information and reduced resolution. To overcome this problem, a bilinear sampling approach is used. The feature vectors are then sampled from the convolution patch in a bilinear interpolation to perform element-by-element multiplication with the kernel. We apply the adaptive dilated convolution to the last layer of the CNN. For each location (the red dot as an example), the associated convolutional patch is reshaped using the off-dilated coefficient vector learned from the regression layer to obtain a perceptual field with flexible size and shape. When the dilated coefficient is equal to 1, the adaptive convolution is the standard convolution; when the dilated coefficient is less than 1, the convolution patch shrinks and the perceptual field shrinks; when the dilated coefficient is greater than 1, the convolution patch expands so that the perceptual field is enlarged. Figure 3 illustrates receptive fields with flexible sizes learned from dilated convolution. We perform experiments using different sizes of convolution kernels, and the experimental results show that 1×1, 3×3, 5×5, and 7×7 convolution kernels perform the best convolution, and other sizes lose some of their ability to convolve feature information at different scales. Here, the convolution kernels are operated for different perceptual fields rather than changing the size of the kernel by applying an expansion factor. These convolution kernels are of fixed size and are used to capture feature information at different scales under different receptive fields. For each receptive field, they capture local details, medium-scale structures, and larger-scale contextual information, respectively.
Let $R^t$ be the 8-dimensional vector of the adaptive dilated coefficient at position $t$ , $R^{t}=\left [r_{x 1}^{t}, r_{x 2}^{t}, r_{x 3}^{t}, r_{x 4}^{t}, r_{y 1}^{t}, r_{y 2}^{t}, r_{y 3}^{t}, r_{y 4}^{t}\right ]^{T}$ . Suppose that for $F^t$ , the associated convolution patch in $X$ is $W^t$ , and its center $\left (s^{t}, v^{t}\right )$ is any convex quadrilateral, the convolution kernel is $K\in \mathbb{R}^{D \times (2k+1) \times (2k+1)}$ , where $k$ is a constant. The shape of $W^t$ is determined by $R^t$ by changing the position of its four corners, each of which is controlled by two components of $R^t$ , and the values of all components in $R^t$ are determined by the position of the corners. Feature vectors $(2k+1)\times (2k+1)$ were selected from the convolution patch $W^t$ for element multiplication. The coordinates of these feature vectors $\left (x_{i j}, y_{i j}\right )$ were determined by the new positions of corner points, and it can be expressed as:
where $s^{t}$ , $v^{t}$ , $\varphi$ are integers, ${i}, {j} \in [-{k}, {k}]$ .
Since the dilated coefficient is a real value, the eigenvector can be obtained by bilinear interpolation. Suppose the convolution patch after bilinear interpolation is $O_t$ , and it can be expressed as:
where $W_{n m}^{t}=W^{t}(n, m)$ , $ \mathrm{n}, \mathrm{m} \in [-\mathrm{\varphi k}, \mathrm{\varphi k}]$ . So the forward propagation of convolution is:
where ${K_{ij}}={K}({i}, {j})$ denotes the multiplication process of all output channel elements and $\mu$ is the bias.
Accordingly, in the case of backpropagation, the gradient change is:
We obtain $\frac{\partial O_{i j}^{t}}{\partial W_{i j}^{t}}$ for the bilinear interpolation equation and then obtain $\frac{\partial o_{i j}^{t}}{\partial x_{i j}}$ and $\frac{\partial O_{i j}^{t}}{\partial y_{i j}}$ for the partial derivatives of the corresponding coordinates. Since the coordinates $x_{i j}$ and $y_{i j}$ depend on the vector of perspective coefficients $R^t$ , we can use the following partial derivatives of the perspective coefficients to obtain the gradient of $R^t$ .
Finally, using these partial derivatives, the gradients of the perspective coefficient mapping $R$ and the input feature mapping $X$ are obtained by the chain rule.
where $r_{x_{-}}^{t}$ denotes any component of $\left \{r_{x 1}^{t}, r_{x 2}^{t}, r_{x 3}^{t}, r_{x 4}^{t}\right \}$ and $r_{y_{-}}^{t}$ denotes any component of $\left \{ r_{y_{-}}^{1}, r_{y_{2}}^{t}, r_{y_{3}}^{t}, y_{x_{4}}^{t}\right \}$ .
3.2. Context feature coding model
There are interconnections between scene features, and each scene feature is related to the overall scene, and these associations help feature classification and generate more accurate scene parsing results. By contextual feature encoding model $P_{2}$ , the contextual association features between scene as a whole and localities are learned. As shown in equation (6):
where $\xi _{2}$ is the parameter of the contextual feature encoding model. $\left \{\gamma _{j}^{(i)}\right \}_{j=1, \ldots J}$ is the $D_{\gamma }$ -dimensional vector of contextual features. Each contextual feature $\gamma _{i}^{(i)}$ encodes the semantic relationship between the jth feature and the whole scene as well as the relationship between the jth feature and other features.
The structure of graph is irregular and has infinite dimensionality, which can describe the semantic contextual relationship between feature points well, so graph convolutional network is used to learn the contextual relationship of scene features. The matrix E is used to represent the nodes of the scene feature graph, A is the adjacency matrix to represent the feature semantic contextual relationship, and the matrices E and A are the inputs of the graph convolutional network. The propagation between each layer in the graph convolutional neural network can be expressed as:
where $\tilde{A}=A+I$ , where I is the unit matrix, $\tilde{D}$ is the degree matrix of $\tilde{E}$ , $F^{k+1}$ is the feature of k + 1 layers, the feature representation of the input layer X is E, $f(\cdot )$ is the nonlinear activation function, and $\rho ^{k}$ is the parameter of learning. The final output of the graph convolutional network is the updated features of node $X^{(i)}$ in the graph, which can be aggregated into a scene feature vector for feature semantic relationship inference.
3.3. Enhancement classification network
Let the input of scene classification model $P_{3}$ be scene features $\alpha ^{(i)}$ , and the output scene classification probability prediction result can be expressed as:
where $\xi _{3}$ is the parameter of the scene classification model, M is the total number of scene classes in the dataset, and $p_m^{(i)}$ is the initial probability that the image $X^{(i)}$ predicted by the scene classification model belongs to the mth class of scenes. The network includes a convolution layer with a filter size of $1\times 1$ , and a softmax regression layer, which can generate scene classification results $\left \{p_{m}^{(i)}\right \} m=1, \ldots, H \times W$ based on the input scene global feature $\alpha ^{(i)}$ .
The underwater heritage is located in an environment with turbid water, and the acquired object features are blurred, which can easily confuse the objects in the process of classification. The reinforced classification network is mainly used to discriminate the categories with similar probability values, from which the correct object category is judged. The input of the enhanced classification network is the initial probability value of $p_m^{(i)}$ , the image features F and the original image X cascaded together. where the image features F are used as a reference during the enhancement classification to improve the enhancement classification accuracy. Since the initial probability $p_m^{(i)}$ , the image features F and the original image X may have a wide range of values, these features are normalized separately. The general diagram of enhancement classification network is shown in Fig. 4.
Based on this input, a series of convolutional layers are then used for enhanced classification. First, a 3 $\times$ 3 convolution is used, which extracts information about the contextual relationships between multiple pixels. We introduce a multi-classifier here and then use a 1 $\times$ 1 convolution to predict the enhanced classification $\alpha _{c}^{(i)} \in \mathbb{R}^{H \times W \times D}$ . Each of these elements $\alpha _{c}^{(i)}$ represents the probability value that the ith pixel in the image belongs to the cth object class. This probability value is normalized by the multinomial logistic regression (i.e., softmax) function as follows:
where $p_{c}^{(i)}$ is the normalized probability. The set of all class probabilities $\left \{p_{c}^{(i)}\right \}_{C=1, \ldots, C}$ is the probability distribution $P(Y \mid X, c)$ sought by the scene semantic parsing task.
In the model training phase, after generating scene classification results, the scene classification loss is calculated. The loss of the strengthened classifier is the multiclassification cross-entropy loss function:
where $y_{i, c} \in \{0,1\}$ is the scene classification label. $y_{(i,c)}=1$ means that the image $X^{(i)}$ belongs to the cth category scene, and $y_{(i,c)}=0$ means vice versa. Since the similarity of probability scoring of multiple categories can easily lead to misclassification, regularization is used to avoid this situation. The difference-based regularization method makes each category probability scoring as large as possible. Here, second-order moments are used as the difference-based regularization:
where $Q \in [0,1-1/ C]$ . The larger the difference between the probability scoring $\left \{p_{c}^{(i)}\right \}_{C=1, \ldots, C}$ , the smaller the value of Q. Thus, the loss function can be reduced to increase the difference between the probabilities.
3.4. Scene parsing model
To generate the scene semantic parsing results, this chapter adopts the scene parsing model $P_{4}$ to predict the category of each region based on the local features and contextual relationship features, so as to obtain the final scene parsing results. Since there are two input features, local features and contextual features, the scene parsing model $P_{4}$ first generates two sets of classification probabilities based on these two features and then integrates the two sets of probabilities. To obtain more detailed predictions, the scene resolution model $P_{4}$ predicts a set of classification probabilities based on the pixel-level local features $\left \{\beta _{j^{\prime }}^{(i)}\right \}_{t^{\prime }=1, \ldots, J^{\prime }}$ , where $J^{\prime }$ is the number of pixels in the image $X^{(i)}$ . And another set of classification probabilities are predicted based on the hyperpixel-level contextual relationship features $\left \{\gamma _{j}^{(i)}\right \}_{j=1, \ldots, J}$ , where J is the number of hyperpixels. After converting the probability results predicted based on the hyperpixel-level contextual relationship features to pixel-level results, they are then integrated with the results predicted based on the pixel-level features. The integration is done by taking the higher of the two probability scores for each pixel for each category as the final classification probability.
The scenario parsing model $P_{4}$ can be modeled as follows:
where $\xi _4$ is the parameter of the scene resolution model, K is the total number of object classes in the dataset, and $q_{j^{\prime }, k}$ is the probability that the j’th pixel in the image $X^{(i)}$ predicted by the scene resolution model belongs to the kth class of objects. In the model training phase, the scene resolution loss is used to constrain the learning of the model. The scene parsing loss uses a pixel-level cross-entropy loss function:
where $y_{i, j^{\prime }, k}$ is the label of whether the j’th pixel belongs to the kth class of objects. Ultimately, a joint optimization approach can be used to combine the scene classification loss function and the scene resolution loss function, while optimizing all the models proposed in this chapter:
where $\lambda$ is a scaling factor that controls the ratio of scene classification loss to scene parsing loss.
4. Experiment and analysis
4.1. Training dataset and evaluation index
In order to evaluate the scene resolution performance of the proposed methods in this paper, the test results are compared with the current state-of-the-art methods. All methods are tested on the Underwater Shipwreck Scenes (USS) dataset, which contains 1163 images of eight categories: wreck, statue, porcelain, sediment, reef, water, plant, and fish. The data of this dataset are a homemade dataset generated by collecting underwater archaeology images online and then by manual annotation. Pixel accuracy (PAcc) and mean intersection-over-union accuracy (mIoU) are used as the evaluation criteria of the proposed algorithm in this paper. Pixel accuracy counts the percentage of correctly classified pixels in the whole dataset. The mean intersection-over-union (mIoU) is calculated separately for each object class, and then, the intersection-over-union values of all object classes are averaged as the final result.
4.2. Experiment details
In this experiment, training and testing are performed on a small server with GTX2080 GPU and 64G RAM. In order to demonstrate the objectivity of the proposed method in the comparison experiments, the Pytorch deep learning tool is used for implementation, and the ResNet convolutional neural network is used for deep network construction. The experimental training parameters are: initial learning rate = 0.0001, epoch = 300, momentum = 0.9, weight decay = 0.0001, and random gradient descent method is used for training. The learning rate is dynamically adjusted using the “poly” strategy.
4.3. Ablation experiment
We validate the effectiveness of adaptive dilated convolution (ADC) and enhancement classification network (ECN) in the proposed algorithm in this paper by ablation experiments. The experiments are conducted on the USS dataset. We use ResNet-50 as the baseline for global enhancement network (GENet) ablation experiments. The experimental results are shown in Table I and include the recognition accuracy of each category in addition to PAcc and mIoU. The visual results are shown in Fig. 5. From Table I, it can be seen that PAcc and mIoU in the baseline network, reached 79.6% and 63.8%. After adding ADC to the baseline, PAcc and mIoU are improved, which are 82.3% and 65.1%, respectively. Some categories such as “statue,” “fish,” and “plant” far exceeded the baseline. This indicates that ADC helps to refine the details of small objects. After adding ECN to the baseline, both PAcc and mIoU were higher than the baseline, 83.1%, and 65.5%, respectively. The recognition accuracy of individual categories is higher than the baseline, especially “wreck,” “reef,” “sediment,” and “water” which were much higher than the baseline. When ADC and ECN were added to the baseline at the same time, the data were significantly improved, and the PAcc and mIoU were as high as 86.9% and 68%, which increased by 7.3% and 4.2% compared to the baseline. The ADC and ECN numbers are much better than adding either one of them. This is because the adaptive convolutional network can change the receptive field according to the size of the object, to extract more features. Strengthening the classification network increases the distance between similar categories and makes classification results more accurate. The combination of the two not only improves the evaluation indexes but also enables the analysis of more complex scenarios. The experimental results demonstrated the effectiveness of ADC and ECN.
The black bolded font in the table indicates the excellence metrics for each algorithm.
From the figure, it can be seen that the baseline network is less effective in recognizing details and small objects (e.g., fish). After adding the ADC network, the overall recognition effect is significantly improved. In the second graph of the second column, it can be seen that the number of recognized fish has significantly increased contour clarity, indicating a better recognition rate for small objects. Adding the EDC network, the effect graph can see the different object boundary contours are clear and the recognition is better. By adding both ADC and EDC networks, the overall visual effect is significantly improved. The visual effect graph illustrates that both ADC and EDC networks proposed in this paper can effectively improve the algorithm’s performance.
4.4. Comparison of different dilated convolutional networks
Dilated convolutional networks were proposed long ago [Reference Yu and Koltun37], and many current studies on dilated convolution have been improved upon. In order to verify the advancedness of the proposed adaptive dilated convolution, the proposed dilated network is replaced by the current state-of-the-art expanded convolutional network for comparison experiments. The experimental results are shown in Table II. The visual effect graph is shown in Fig. 6.
The black bolded font in the table indicates the excellence metrics for each algorithm.
As can be seen in Table II, the adaptive dilated convolution proposed in this paper performs better in terms of performance, with the highest PAcc and mIoU. For each category, the recognition accuracy performs best overall. However, other methods also have the highest recognition accuracy for a single category. The highest accuracy of porcelain recognition by DCLS is 82.3%, which is higher than 0.3% in the dilated convolutional network proposed in this paper. The maximum pixel accuracy of AFDC for wreck is 88.1%, which is higher than the proposed dilated convolutional network by 0.4%. The highest recognition accuracy of ADC for water is 86.0%, which is 0.1% higher than the proposed algorithm. All the above methods obtain more context information by increasing receptive field, but the overall recognition accuracy in complex scenes with both large and small objects is slightly lower than that of the network proposed in this paper. The proposed adaptive dilated convolution can change the size and shape of the receptive field arbitrarily according to the size and shape of different target categories, thereby achieving better performance in PAcc and mIoU on the USS dataset.
As can be seen in Fig. 6, the visual effects of different dilated convolutions for different classes of object recognition are basically the same. However, the adaptive dilated convolution proposed in this paper outperforms the other dilated convolutions in terms of the overall object recognition contour and the recognition accuracy of small objects. From the visual results, it can be seen that DCLS performs poorly on low-resolution images because using a learnable spatial sampling rate can result in small kernel sizes that do not capture sufficient contextual information, thus affecting segmentation performance. Although AFDCNet can adapt to different object sizes, its performance is not as good as other networks when dealing with dense small objects. As can be seen from the third image in the first row, when the target part in the scene is occluded or deformed, the prediction performance of ADC is significantly lower than other methods.
4.5. Experimental results and analysis under different conditions
4.5.1. The parsing results under the general condition
To evaluate the performance of the proposed algorithm in regular scenes, it is compared with the existing more advanced methods. The visual results of its detection are shown in Fig. 7, class pixel accuracy and PAcc as shown in Table III, and the class intersection-over-union and mIoU as shown in Table IV.
The black bolded font in the table indicates the excellence metrics for each algorithm.
The black bolded font in the table indicates the excellence metrics for each algorithm.
From the experimental results data in Table III and Table IV, we can see that the PAcc and mIoU of the proposed algorithm are higher than other algorithms, 86.9% and 68%, respectively. The pixel accuracy and mIoU of all eight categories are improved compared with DFANet, which has increased the perception field, but it is difficult to identify different object categories accurately because of the inconsistent size and large difference of object categories in underwater shipwreck scenes. Especially for small objects porcelain and fish, the recognition rate is the worst among all methods, 80.5% and 73.6%. Although DFANet increases the receptive field, the size of object classes in underwater shipwreck scenes is inconsistent and varies greatly. Therefore, it is difficult to recognize different object categories accurately. DecoupleSegNets uses a feature pyramid-based approach to process different scales of input images, but it is still sensitive to the input image size. The accuracy of porcelain recognition is higher than the proposed algorithm by 0.7%, while all other categories are lower than the algorithm in this paper. The attention mechanism and non-local neural network structure of ANNNet make it have certain black box nature and poor interpretation for some categories. SegFormer has a certain limitation on the size of the input image, which needs to be adjusted within a certain range. And the adaptive dilated convolution proposed in this paper can automatically adjust the size of the perceptual field according to the object category. It can fully extract multi-scale features of different target classes. The category pixel accuracy and mIoU of the proposed method are mostly higher than the other three algorithms. The algorithm proposed in this paper has the highest recognition accuracy for statue, reef, sediment, and fish under general conditions and is significantly higher than the other compared algorithms. The above comparative experimental results show that adequate feature learning, construction of contextual semantic relations, and effectively enhanced classification can significantly improve the performance of the proposed method on the USS dataset.
From the visual effect figure, it can be found that for dense and relatively small objects (fish), it is easy to recognize multiple as one. The boundary recognition result is blurred when different objects intersect. There is a phenomenon of miscalculation in the figure, for example, “reef” is recognized as “wreck.” Overall, the overall recognition degree of the proposed method for different object categories and the boundary contours are relatively good, which proves that the performance of the proposed algorithm on the USS dataset is better than other algorithms.
4.5.2. The parsing results under the semi-burial situation of cultural relics
To verify the effectiveness of the proposed algorithm under different situations. The visual effects are compared with other methods in the case of semi-buried cultural relics, and the visual effects are shown in Fig. 8. Class pixel accuracy and PAcc with are shown in Table V, and class intersection-over-union and mIoU are shown in Table VI.
From the experimental results data in Table V and Table VI, it can be seen that the PAcc and mIoU of the algorithm proposed in this paper are higher than the other algorithms in the case of semi-buried cultural relics, 85.4% and 65.4%, respectively. From the data in the table, it can be seen that most of the target categories of the algorithm proposed in this paper are higher than other algorithms. The pixel accuracy and mIoU of the proposed algorithm are lower than those of the DFANet algorithm by 1.2% and 0.6%, respectively. The pixel accuracy and mIoU for fish are lower than the algorithm ANNNet by 0.6% and 0.3%, respectively. Pixel accuracy and mIoU for the plant are lower than the algorithm DecoupleSegNets by 1% and 0.4%, respectively. However, the pixel accuracy and mIoU of semi-buried artifacts (wreck, statue, porcelain) are significantly higher than the other algorithms. In the case of semi-buried artifacts, adaptive dilated convolution can use the dilated coefficient of the convolution kernel to expand the perceptual field, which in turn improves the feature extraction ability of the target class. At the same time, adaptive dilated convolution can also utilize the size and shape of the convolution kernel for feature enhancement and compensation to improve the expression ability and recognition performance of the target category. The comparison results of experimental data show that adjusting the perceptual field size according to the object size can fully extract object features, fuse contextual information features, and improve the segmentation accuracy of object categories.
It can be seen from the visual effect figure that the recognition of the semi-buried cultural relics algorithm proposed in this paper has the highest completeness in the recognition of cultural relics. As can be seen in the experimental effect plots of other methods, there exist buried cultural relics with plants or sediment being recognized as part of the cultural relics, resulting in blurred edges of the object class. For some smaller object categories, there also exists the phenomenon that they cannot be recognized. The comprehensive experimental effect figure can be found that the proposed method in the paper has a high degree of completeness in artifact recognition and a relatively good edge recognition effect, which further proves the effectiveness of the proposed algorithm.
The black bolded font in the table indicates the excellence metrics for each algorithm.
4.5.3. The parsing results under the turbid water condition
Due to the presence of a large amount of sediment in the site where the cultural relics are located, the water quality of the environment in which they are located is turbid. In order to further verify the effectiveness of the proposed algorithm, the visual effect diagram is shown in Fig. 9 for comparison with other algorithms in the case of turbid water, the category detection accuracy and detection pixel accuracy are shown in Table VII, the category cross-accuracy and mIoU are shown in Table VIII.
The black bolded font in the table indicates the excellence metrics for each algorithm.
The black bolded font in the table indicates the excellence metrics for each algorithm.
The black bolded font in the table indicates the excellence metrics for each algorithm.
From the experimental result data in Table VII and Table VIII, we can see that the PAcc and mIoU of the proposed algorithm in this paper are higher than other algorithms, which are 84.9% and 65.5%, respectively. Meanwhile, the category pixel accuracy and mIoU of eight categories are improved, which are higher than other algorithms. Compared with DFANet, the proposed algorithm improves pixel accuracy by 3.4% and mIoU by 1.3%. The pixel accuracy of the algorithm presented in this paper is 3% and 1.2% higher than that of DecoupleSegNets. The pixel accuracy and mIoU of ANNNet are 2.9% and 1% lower than that of the proposed algorithm. Compared with SegFormer, the pixel accuracy of the algorithm proposed in this paper is improved by 2% and that of mIoU by 0.8%. When there are ambiguous and difficult-to-judge samples, a single classifier may not be able to accurately classify these samples because they may belong to multiple classes or not to any one class. However, stepwise classification using multiple classifiers can better handle these ambiguous samples. In a multi-classifier model, each classifier is responsible for classifying a set of categories. First, the initial classifier classifies all samples, and misclassified or ambiguous samples, they are sent to the next classifier for further classification. This process can be repeated several times until all samples are assigned to the last classifier for final classification. The experimental results show that the proposed algorithm can effectively improve the segmentation performance of object categories under the turbid water condition, which verifies the effectiveness of the proposed network.
As can be seen from the visual effect diagram, all the category boundaries are blurred, which is caused by the blurring of the extracted features due to the turbidity of the water, which makes it difficult to discriminate in the classification process. And it can be seen from the figure that there is the phenomenon of object category misclassification. Due to the blurred water quality, the features are difficult to extract, and some objects are identified as similar backgrounds. For example, “porcelain” is identified as “sediment,” and “statue” is identified as “reef.” From the last line of the effect, we can see that our proposed method has a good visual effect, which proves the effectiveness of the proposed method in turbid water conditions.
4.6. Comparative experiments on public datasets
To verify the generalization of the algorithm proposed in this paper, we conducted comparison experiments with current state-of-the-art algorithms on a public dataset containing the ADE20K dataset, the Cityscapes dataset, and the SUIM dataset. ADE20K covers various annotations of scenes, objects, and object parts, and contains various objects in natural space environments. Each image has an average of 19.5 instances and 10.5 object classes with 150 semantic categories. The Cityscapes dataset contains 20,210 training images, 2000 validation images, and 3351 test images. 30 different categories of city street data from 50 different cities for different periods and seasons are included in the Cityscapes dataset. The SUIM dataset is the first publicly available large-scale underwater semantic segmentation dataset with 2975 training images, 1525 test images, and 500 validation images. The dataset has 1525 images and contains 8 categories: Background (waterbody) (BW), Human divers (HD), Aquatic plants and sea-grass (PF), Wrecks or ruins (WR), Robots (RO), Reefs and invertebrates (RI), Fish and vertebrates (FV), and Sea-floor and rocks (SR).
4.6.1. ADE20K and Cityscapes comparison test
Since the experimental results on the ADE20K and Cityscapes datasets are similar, we put the two experimental results together for analysis. Due to the large number of scene categories in these two datasets only PAcc and mIoU are reported in this paper, and the experimental data are shown in Tables IX and X, and the visual effect graphs of scene resolution are shown in Figs. 10 and 11. The experimental data in Table IX shows that the proposed algorithm achieves the highest PAcc and mIoU on both ADE20K dataset and the Cityscapes dataset. DFAT is not fine enough for object edge detail processing, resulting in lower data and a less accurate visual effect of scene parsing results. The attention mechanism in ANNNet is based on a homogeneous image grid, which is less effective in resolving irregular shapes in the scene. SegFormer performs differently on different size input images. The overall effect of the proposed algorithm and the current state-of-the-art algorithm on visual scene resolution is the same, but the proposed algorithm performs better in detail. The proposed algorithm can make a consistent prediction for large objects and an accurate prediction for small objects by expanding adaptive convolution.
The black bolded font in the table indicates the excellence metrics for each algorithm.
4.6.2. SUIM dataset comparison test
To further verify the advancedness and performance of the algorithm proposed in this paper for underwater images, we conducted comparison experiments between current advanced algorithms and the algorithm proposed in this paper on the SUIM dataset. In addition, we added the SUIMNet method to this set of comparison experiments. The experimental data are shown in Table XI, and the visual effect graph of scene resolution is shown in Fig. 12.
The black bolded font in the table indicates the excellence metrics for each algorithm.
The data in the table show that the proposed algorithm in this paper has the highest PAcc and mIoU among the comparison methods, which are 93.1% and 81.1%, respectively. In terms of category pixel accuracy, the algorithm proposed in this paper is slightly lower than other algorithms for the whole part of the category. In terms of category pixel accuracy, SUIMNet has the highest accuracy for BW and FV pixels, which is 0.2% and 0.8% higher than the proposed algorithm. decoupleSegNets has the best recognition for HD with 90.2%, which is 0.3% higher than the proposed algorithm. ANNNet has the highest recognition accuracy for WR with 86.3%, which is 0.5% higher than the proposed algorithm. The visual resolution of this paper is clearer than that of this paper, as can be seen from the visual effect graph. This is because the SUIM dataset is larger and can provide more samples to train the model, thus better capturing the statistical patterns of the data. In addition, the SUIM dataset already contains many high-quality underwater images with clearer category features than the USS dataset. The higher-clarity images are easier to be classified during the training process because the object boundaries and details are clearer.
The black bolded font in the table indicates the excellence metrics for each algorithm.
5. Conclusions
In this paper, we propose a GENet scene parsing method for underwater archaeological scenarios to solve the problems encountered in this scenario. Adaptive dilated convolution is proposed to obtain flexible small and large and shaped receptive fields for scene resolution. The adaptive dilated convolution reshapes the relevant receptive fields by learning the vector of dilated coefficients to adaptively change the convolutional ground patches and shapes. In addition, we propose a reinforced classification network, which optimizes the classifier based on the difference-ground regularization method and can effectively distinguish confusable categories. Besides, a self-made USS dataset is developed for underwater archaeological scenes. And extensive experiments are conducted on the USS dataset to verify the effectiveness of the proposed method. The effectiveness of the proposed network is demonstrated by ablation experiments. The experimental results in comparison with the current state-of-the-art dilated convolutional network show that the adaptive expanded convolution has excellent performance. Finally, we compare with the current state-of-the-art methods in three different cases, and both PAcc and mIoU are higher than other algorithms, proving the superiority of our algorithm. To verify the generalization of the algorithm proposed in this paper, we did comparison experiments on the public dataset ADE20K, cityscapes, and underwater dataset SUIM, and the experimental results show that the algorithm proposed in this paper performs well on the public dataset and outperforms other algorithms. In the comparative experiments on public datasets, we can see that the experimental results of the same algorithm on public datasets are significantly higher than the homemade datasets in this paper. This is because the public dataset has various scenes and complex and changeable categories. The adaptive expansive convolution convex polygons with the arbitrary variation of receptive field proposed in this paper apply to images with certain rules, but the effect is not satisfactory for complex image structures and irregular edges. In addition, the USS dataset proposed in this paper has certain limitations. The dataset contains few scene and object categories and a small amount of data, so the model cannot be fully trained. Our next step will be to optimize the performance of the proposed algorithm and expand the size of the data set to improve the diversity of the scene data. The proposed method and USS dataset can be applied more widely.
Authors’ contributions
Junyan Pan wrote the article. Jishen Jia reviewed and edited this article. Lei Cai conceived and designed the study.
Financial support
This work was supported by the Science and Technology Project of Henan Province (grant number 222102110194, 222102320380, 232102320338).
Competing interests
The authors declare that there are no Competing interests regarding the publication of this article.
Ethical considerations
None.