1. Introduction
In recent years, deep learning has been widely used in various fields, including autonomous driving [
1,
2], computer vision [
3,
4,
5,
6], etc. Object detection [
7] and semantic segmentation [
8,
9,
10] are currently the most prevalent algorithms used in the field of autonomous driving. However, these tasks have their own shortcomings: object detection can only identify the location of traffic objects and output a bounding box with location information; it cannot describe its shape in detail. While semantic segmentation is proficient at achieving pixel-level segmentation, it lacks the ability to differentiate between individual instances. The segmentation of objects, including buildings, nature sceneries, and others, exhibits excessive redundancy in the context of autonomous driving.
Therefore, we choose instance segmentation [
11] as the juncture between semantic segmentation tasks and object detection tasks. Instance segmentation not only identifies the location of objects of interest but also segments each object of interest pixel by pixel within an image. Consequently, applying the instance segmentation algorithm to the scene of autonomous driving can distinguish pedestrians, cars, riders, and other objects of interest at both the instance and pixel levels, which has engineering significance. The instance segmentation network can output both the bounding box of the object and the segmentation result of the object. The bounding box gives the location information of the objects; the segmentation result can describe the specific shape of the objects. Therefore, we adopt instance segmentation network to study the missed detection problem.
The complexity of autonomous driving scenarios increases the likelihood of the problem of missed detection. Failure to recognize and segment traffic objects, such as cars and pedestrians, during the act of driving can significantly increase the likelihood of life-threatening traffic accidents. Therefore, it is imperative to address the issue of missed detection in order to enhance both the safety and efficiency of traffic.
Existing methods for instance segmentation concentrate on developing new paradigms while ignoring the problems of missed detection. Faster R-CNN [
7] is the first two-stage object detection algorithm. First, the bounding boxes are initially screened through the Region Proposal Network (RPN) [
7]; then, the network head and NMS post-processing strategies are used to screen them a second time, greatly reducing the number of bounding boxes. Mask R-CNN [
12] is an extension of Faster R-CNN, so Mask R-CNN is also the first two-stage instance segmentation network. Mask R-CNN adds a mask head based on Faster R-CNN. During the inference process, the bounding boxes of the detection head are first obtained, then the bounding boxes are aligned to obtain the mask RoI features. Finally, the instance mask is obtained. Since the two-stage instance segmentation algorithm relies heavily on the bounding boxes of the detection head, and the number of the bounding boxes is not guaranteed, it causes the problem of missed detection. In the single-stage instance segmentation network, the SOLO algorithm [
13] is relatively typical. SOLO directly predicts instance mask by introducing the concept of “instance categories”, and assigns the “instance categories” to each pixel in the instance according to its location and size. Then, SOLOv2 [
14] is proposed, wherein the representation and learning process of the mask are obtained by convolution of dynamically generated convolution kernel parameters and feature maps. The principles of single-stage instance segmentation methods like SOLO and SOLOv2 are the same. The central concept of the single-stage instance segmentation is that a single location corresponds to a single instance mask. This method strikes a good balance between precision and speed, but the concept itself is the source of missed detection. Assuming that the traffic objects are too close, the network determines that centers of traffic objects are in the same location. Consequently, only one instance mask is predicted, and the mask is ambiguous, resulting in missed detection. Multi-stage methods are divided into two types based on whether the attention mechanism [
15] is used. The first type is the traditional method that does not use the attention mechanism [
15], such as HTC [
16] and Cascade R-CNN [
17]. This type of algorithm is an extension of the two stages. After each stage, the bounding boxes will be filtered layer by layer and become very few in number, so they are not suitable for solving the problem of missed detection. Owing to the network’s inherent characteristics, the aforementioned traditional instance segmentation methods result in missed detection. This characteristic cannot be altered through structural optimization and, therefore, cannot be used as a baseline to solve the missed detection problem. Therefore, we shift our focus to attention mechanism-based [
12] instance segmentation.
Query-based instance segmentation is based on the attention mechanism [
12], which treats instances of interest as queries that can be learned. We select QueryInst [
18] as our baseline for addressing the missed detection problem. QueryInst is the first query-based instance segmentation method. It proposes using the one-to-one relationship between object query and mask Region of Interest (RoI) features and assigning instance information (shape, location, associative embedding, etc.) contained in the query to the RoI features via dynamic mask heads. During training, QueryInst employs a parallel supervision strategy, wherein gradients computation and parameter updates are simultaneously performed across various stages of the network without mutual interference.
We visualized the feature layers with the largest scale of Feature Pyramid Network (FPN) [
19] in CAM [
20] in order to be able to analyze in detail which factors cause the missed detection. As depicted in
Figure 1, we compared the original images, the visualized heat maps, and the final instance segmentation results. The red-circled objects were not detected and segmented due to a detection error in the third column. According to the heat maps, the recognized instance’s position has a maximum heat value. The position of the missed instance, however, has a lower heat value or even no heat value. This indicates that the network has not fully learned the features of the missed instance. The similarities between the missed instance features and background features result in the missed instance features being ignored. Therefore, strengthening the network to extract missed instance features from different levels and completing the missed instance features are essential to resolving the missed detection problem.
Therefore, we propose the CompleteInst algorithm, a multi-stage and efficient perception algorithm based on the attention mechanism that solves the problem of missed detection by strengthening the learning of the features of missed detection instances. Our method solves the problems of missed detection at both the feature and instance levels. At the feature level, we propose global pyramid networks (GPN) and the semantic branch. GPN collects the global information of all instances in each feature layer, strengthens the connection between instances with missed detection and other instances, and completes the global features of missed detection instances. The semantic branch enables the missed instances in the form of RoI to obtain semantic information, enhances the distinction between the missed instances and the background, and completes the missed instances’ semantic features. At the instance level, we propose query-based optimal transport assignment (OTA-Query) by generalizing traditional OTA [
21]. OTA-Query augments the number of positive samples for each missed instance, allowing more high-quality positive sample features to participate in the regression of masks and bounding boxes and indirectly completing the missed instance features at the instance level.
Notably, both our semantic branch and OTA-Query are parallel, which is compatible with QueryInst’s parallel supervision mechanism. These methods do not interfere with one another between stages, and they all occur within a single stage. To highlight the parallelism of our structures, we also introduce non-parallel structures as a contrast, utilizing the outcomes of the preceding stage, such as the dynamic interactive module, the enhanced RoI features, etc. Due to their violation of the parallel supervision mechanism, these non-parallel structures did not achieve good results.
The algorithm known as CompleteInst is a query-based instance segmentation method that has been developed by our team. The primary objective of all improvement efforts is to address the issue of missed detection. The evaluation conducted on the Cityscapes and COCO datasets demonstrates that our model surpasses more sophisticated algorithms in terms of performance. The primary contributions of this study are outlined below:
- (1)
To strengthen the connection between missed detection instances and other instances, we improve FPN and propose GPN. GPN completes the global features of missed detection instances.
- (2)
To enhance the distinction between the missed instances and the background, the semantic branch is proposed. The semantic branch completes the semantic features of missed detection instances.
- (3)
To improve the quality of positive samples for each missed instance, OTA-Query is proposed.
- (4)
To highlight the parallelism of above structures, non-parallel structures are introduced. The parallelism structures are proven to be more effective.
The remainder of the paper is structured as follows: In
Section 2, related research on deep learning-based instance segmentation methods is presented. In
Section 3, the proposed CompleteInst method’s flow and operating principle are described. The experimental results obtained by the proposed CompleteInst algorithm are analysed in
Section 4.
Section 5 provides the principal conclusions.
3. Methodology
We propose the CompleteInst network based on QueryInst to solve the missed detection problem in autonomous driving scenarios. As
Figure 2 shows, the entire network structure consists of three parts: backbone, GPN, and its head. The head is mainly composed of semantic branch (
Smask and
Sbox), Dynamic Convolution (DynConv), Multihead Self Attention (MSA), and OTA-Query modules. To fully interact with the query, the head is iterated through 6 stages. In the following sections, we introduce the above-mentioned modules in greater depth.
3.1. The Overall of CompleteInst
The CompleteInst structure is depicted in
Figure 2. First, following backbone and GPN extract and aggregate features, the feature layer
xGPN with four different scales is obtained. Next,
xGPN is fed to the head. The head consists of two parts: the detection head and the segmentation head. The detection head outputs the detection result, and the segment head outputs the segmentation result. The detection head pipeline consists of the following steps:
As described by the above formula, the bounding box and xGPN are entered into the semantic branch Sbox to get the interested box features . The semantic branch enables missed instances to obtain semantic features and complete semantic information. At the same time, the query input to the t stage is processed by the multi-head self-attention mechanism () to obtain the converted query . correlates query features with each other and removes the isolation between queries. Then, through the dynamic convolution head link and query , the query is achieved by decoding . Dynamic convolution decodes RoI features through matrix multiplication, and assigns instance information in query features to RoI features. The enhanced box RoI feature and query which input to the next phase are obtained. is regressed through the detection branch to obtain the detection result and is used for both the input of the next stage and the mask branch for current stage.
The segmentation head pipeline is as follows:
The segmentation pipeline is similar; and are entered into the semantic branch to obtain . The semantic branch enables missed instances to obtain semantic features and complete semantic information. The dynamic convolution head dynamically interacts and to obtain enhanced mask RoI feature through matrix multiplication. Finally, is input to the mask branch , and the t-stage instance mask is obtained through the full convolution. The query and the bounding box are fed to the next stage. The above process is iterated six times, refining the bounding boxes and masks.
In the sample assign stage, we adopted the OTA-Query sample allocation strategy, considered the allocation method with the smallest sum of costs to be the Wasserstein distance, applied the Sinkhorn algorithm [
36] to solve it, and obtained the results that one label corresponds to multiple positive samples.
Throughout the whole pipeline, GPN, semantic branch, and OTA-Query play an important role in solving the missed detection problem. In the following sections, the implementation and functions of each core component of CompleteInst will be introduced in detail.
3.2. Global Pyramid Networks
We propose GPN because the features of missed instances exist in the FPN feature layer. Through up-sampling and the addition of corresponding position elements, the FPN structure performs multi-scale feature fusion. However, this fusion method has certain limitations in that it only considers the correspondence between its own instance features at different scales while ignoring the relationship with other instance features at a greater distance at the same scale. In addition, the acquisition of the feature layers is carried out by superimposing the 3 × 3 convolution in the depth direction, and the scope is also in the form of a local window; therefore, insufficient global information is collected. On the basis of the aforementioned two limitations, we propose GPN and GCN convolution to enhance the acquisition of global features of missed instances and the connection between different instance features at the same scale.
Our GPN are depicted in
Figure 3a, where the original FPN is contained within the dashed line. We reconnect the GCN convolution after the original FPN output in order to collect global information for instances that are missed. Finally, global information-containing feature layers G2–G5 are obtained. G2–G5 are considered as inputs for the following six stages.
Figure 3b depicts the specific implementation of each step of GCN convolution.
Firstly, the global modelling is performed on feature maps. For the query location j in the feature maps, the attention weight of location j is first obtained through 1 × 1 convolution. It is multiplied by the feature corresponding to the j location, and then sums all the query positions (matrix multiplication) to obtain the corresponding global features . Then, SoftMax normalization is performed. In order to reduce the calculation consumption caused by the increase of the number of channels in the deeper layer and get channel information, we set the 1 × 1 convolution in channel modelling, so that the number of channels after the convolution is C/r, where r is the bottleneck ratio. The selection and comparison of the r value are conducted in the subsequent experimental phase. The obtained global features have channel dependencies. Finally, the broadcast mechanism is used to add element locations to complete global features.
3.3. Semantic Branch
The semantic branch is proposed due to the presence of missed instances’ features within the RoI features. The semantic branch is incorporated to generate semantic features, providing semantic information for missed instances and enhancing the differentiation between these instances and the backdrop. In the semantic branch, we also include another head to supervise the semantic features so that our semantic information is explicit. The final step is to combine the RoI with semantic features with the original RoI with global features to obtain more comprehensive instance features.
The formula expresses the specific process:
Figure 4a illustrates the particular structure of our semantic branch. We use the G2 layer as input in the first step because this layer contains more specific information and integrates high-level semantic information, resulting in richer information. After three 3 × 3 convolutions, the Pyramid Pooling Module (PPM) module [
37] is introduced in order to obtain semantic features with varying scales.
Figure 4b portrays the PPM module. Specifically, the feature layer is divided into 6 × 6, 3 × 3, 2 × 2, and 1 × 1 grids, then each grid is averaged and pooled, and the pooled results are aggregated by up-sampling. A 1 × 1 convolution is then used to adjust the channel dimension to match the dimension of the original RoI features. Semantic predictions are derived by the process of logical convolution. To ensure the quality of the semantic features generated by the succeeding branches, we apply cross-entropy loss on the predictions and compare them with the semantic segmentation labels.
The second step involves performing feature fusion. We use bounding boxes to align semantic features to obtain RoI semantic features. In the same way, we align the original GPN to obtain the RoI global features and add the elements of the two to obtain the more comprehensive RoI features for subsequent dynamic interaction and regression.
3.4. OTA-Query
QueryInst’s sample allocation strategy corresponds one label to one positive sample. This one-to-one optimal allocation problem is treated as a bipartite matching by QueryInst, which employs the Hungarian algorithm to solve it. However, the current allocation strategy of assigning one label per positive sample is suboptimal for addressing the missed detection problem. It would be more suitable to adopt an allocation strategy of assigning one label per multiple positive samples. As depicted in
Figure 5, under the one-to-one sample allocation, the old man in the back is incorrectly identified (as depicted in
Figure 1), indicating that a single positive sample for regression is insufficient to improve the quality of features for missed instances. Therefore, we present OTA-Query to match missed instances with multiple positive samples and enhance the network’s extraction of missed instance features based on the number of positive samples.
The OTA-Query sample allocation strategy is designed. Specifically, the label is considered the supplier, and a specific number of
labels are assigned. Considering the samples as the demander, a positive sample seeks a label. In addition, the background class is considered a unique supplier, as it supplies the “background label.” Suppose there are m labels, and each label can provide
k labels of its own number to assign to positive samples in an image. This image contains n boxes, each of which will receive a label. Positive samples are those that successfully match the label, while the remaining
n −
k ×
m samples are assigned to the background class and become negative samples. Consequently, the following target formula is presented:
The optimal transfer problem involves the determination of the Wasserstein distance, which is defined as the minimal total cost incurred in transferring each label to its corresponding sample.
i represents the label index and j represents the sample index.
indicates the
i-th label supplied to the
j-th sample.
represents the cost of supplying the
i-th label to the
j-th sample. The specific cost calculation method is as follows:
The cost is the weighted sum of classification loss, regression loss, and GIoU loss, assuming the sample is a positive sample. The weight coefficients α, β, and γ adhere to the setting of the loss function, which are 2, 5, and 2, respectively. If the sample is negative, the cost is restricted to classification loss. To determine the k value, we calculate the IoU between the sample and the label, select the top 10 IoU values for each label, add them, and round them up. To obtain the optimal matching result for the solution of the target formula in Equation (6), the Sinkhorn algorithm for iterative calculation is used.
3.5. Parallel and Non-Parallel
Our semantic branches and OTA-Query are parallel. All of their implementations occur within the stage, which is compatible with the parallel supervision mechanism of QueryInst, thereby enhancing the extraction of missed instance features. In each stage, the RoI features with semantic information extracted by the semantic branch will be fused with the original RoI features. In addition, OTA-Query will perform one-to-multi sample allocation at each stage. The preceding stage and the final stage of the structure described above do not interfere. To demonstrate the superiority and efficacy of the aforementioned parallel structure, we propose a non-parallel structure in which different stages influence and interact with one another for comparison. As depicted in
Figure 6, we made four modifications to the QueryInst algorithm for the serial interaction of the mask branch.
The overall logical framework is depicted in
Figure 6. These four structures are the serial interaction across stages. The first structure indicates that the four convolution sequences at the same location in the previous stage are fully utilized before the convolution of the current stage. This is described with the formula:
where the
represents the enhanced mask feature. The
is continuously input to the mask heads of the
t − 1 and
t stages, and finally the segmentation result
is obtained.
The second structure makes extensive use of the dynamic mask interactive module of the previous stage. This is described with the formula:
where
represents the mask feature and
represents the enhanced query feature.
and
are input to the dynamic heads of the
t − 1 and
t stages continuously, and finally the
is obtained.
The third structure is the fusion of enhanced mask features between stages. This is described with the formula:
where
and
represent the enhanced mask feature of the
t − 1 and
t stages, respectively. The sum of
and
results in the new
.
The fourth structure is to unify the first three non-parallel structures, and the complete the serial interaction of the mask branches. The formula is expressed as:
This can be expressed as the feature is first input to the t − 1 stage to obtain the result, then the result is input to the t stage to obtain the final .
3.6. Loss Function
To supervise the semantic branch based on the baseline, we added an additional loss function. Consequently, the overall loss function comprises the following components:
For the detection branch, we adhered to the hyperparameter settings of Sparse R-CNN [
38], where
,
, and
are 2, 5, and 2, respectively. We adopted Focal Loss [
18] as the category loss function
, and L1 Loss and GIoU Loss are utilized as the bounding box loss functions
and
, respectively. For the segmentation branch, we followed the hyperparameter settings of QueryInst [
18], where
β is 8, and
adopts Dice Loss. For supervision of the semantic branch, we used the cross-entropy loss function, with the following formula, where s is the semantic segmentation result and
is the label. The ablation experiment section discusses the selection of the
γ parameter.