2.3.1. Heat Map Guide Query Initial
In object detection tasks, heat maps can help identify the locations and regions of objects, thus indicating which regions the model believes contain specific objects. Therefore, we used heat maps to provide prior information and initialize query content embedding. We used radar to generate heat maps during the first training. For generating heat maps using radar BEV features, the procedure is as follows: Give the radar feature map , where , , and represent the channel number, height, and width of the radar feature map, respectively. Initially, a fully connected layer is utilized to transform the channel number of the radar feature map into the number of classes that the network needs to predict. Subsequently, each value in the radar feature map is normalized to the range of 0−1. Next, we have a tensor , where of the same size as the radar feature map is established to acquire local maxima, thus following the rule that each value must be greater than the surrounding 8 values. The values of are then compared to the values of one by one, and if they are the same, the maximum value is retained, thus selecting the top−K maximum values in descending order. Finally, the two−dimensional feature map is folded into a one−dimensional vector, thus obtaining the index of each maximum value on the one-dimensional vector. Using these position indices on the folded original radar BEV feature map, the values of the original radar BEV feature map are obtained. At this point, a heat map is given, where is generated from the radar BEV feature map, and through transposition, the query initialized by the heat map can be obtained, where M represents the dimension of each query, which is equivalent to the size of . An additional output head is added at the output of each decoding layer for predicting the heat map .
For the generation of the ground truth of the heat map, firstly, the number of objects with a ground truth is obtained from the ground truth labels, and the width h and height w of the object frame are obtained from the ground truth label of each object, which are used to compute the radius r. It is still mainly dependent on the overlap between the real frame’s and the predicted frames, and the value of the radius r is taken according to the different critical situations.
According to the form of the solution of the binary equation, three radii
,
, and
can be obtained by
:
Taking the minimum of , , and is the required radius r.
Next, use the obtained for radius r to establish the rectangular grid coordinates of rectangles −r−r. For each coordinate, apply the Gaussian function to obtain the Gaussian distribution of the rectangles in order for the values spreading out in all directions from the center of the object to decay according to the distance for penalizing long-range prediction. Then, obtain the center of the object (x, y) from the real value label, obtain the size of the top, bottom, left, and right that each center can cover according to the width, height, and radius of the heat map, and crop the Gaussian distributed rectangle to the center of each object, and this completes the generation of the real value of the heat map.
2.3.2. Dynamic Position Encoding
Positional encoding is a crucial concept in image tasks, particularly within the Transformer architecture. Due to its self-attentive mechanism, the Transformer inherently lacks order perception of the elements in a sequence. Consequently, without positional encoding, the Transformer cannot distinguish the relative positions of the elements. Positional encoding enhances the model’s expressive power by providing additional information, thus enabling better sequence understanding and generation. Moreover, positional encoding is learnable, thus allowing the model to adapt during training to better suit specific tasks.This section introduces improvements to the traditional positional encoding in the Transformer architecture for the task of 3D object detection. Unlike traditional positional encoding, which uses random parameters to learn positional information from the query, the dynamic positional encoding proposed here leverages positional information connected with the predicted values (x, y, z) of the outputs from each layer. This approach provides more accurate positional information related to the objects.
A comparison with the structure of conventional position coding is illustrated in
Figure 5. For the reference point
, which is used in detection models as a crucial medium for the interaction of different modalities, the previous algorithm employed randomly initialized position encoding information
. In contrast, the proposed dynamic position encoding algorithm first initializes the value of the reference point, then utilizes this reference point to generate position encoding information, and subsequently uses the output of each layer to continuously optimize the position of the reference point. This process ensures that the position information of the reference point more closely aligns with the position of the real object. By using the position encoding generated from the reference point, a positional prior is provided for the subsequent decoding of the query, thus allowing the query to search for the position of the real object more accurately. The use of reference points to generate position encoding for query decoding provides a positional prior that enables the query to begin its search from a position relatively close to the actual object position. This approach results in faster convergence and improved performance compared to the traditional position encoding method, as it reduces the solution space and thereby lessens the breadth and complexity of the neural network’s parameter learning.
Then, for the obtained reference point above, it first passes through a sinusoidal encoding module:
where M is the dimension representing the position encoding, the new reference point
is obtained by sinusoidal encoding, and then the reference point
is obtained by a fully connected layer F with the parameter (3M/2,M). This algorithm also sets a scale function Scale for the MLP multilayer perceptron with parameters (M, 2M, M) for generating
for sensing object scale information for position encoding, where the scale function
is 1 for the first of the L decoding layers and
for the other layers. The output of a layer is generated by the scale function Scale, and the final position encoding Pos is the product of
and
.
2.3.3. Auxiliary Noise Query
The decoder addresses target detection as an ensemble prediction problem, thus aiming to achieve an optimal matching between each predicted frame and the true frame that minimizes the overall cost. The instability of the cost matrix results in unstable matching between the query and ground truth, which frequently disrupts the learning process of the query. To alleviate this problem, this paper introduces an auxiliary noise query module using the true value information to stabilize the matching process and optimize the learning conditions of the query.
In this paper, the query is divided into two parts. The first part, the matching component, follows the same processing method as the query in the previous model, thus using the Hungarian algorithm for matching and learning to approximate the true value labeled pairs and using the matched decoder outputs. The second part is the denoising component. The input for this component is the noisy ground truth, and the output aims to reconstruct the real 3D frame object. Since the added noise is relatively small, it is easier for the model to predict the corresponding GT based on these noisy inputs, thereby reducing the learning difficulty. Additionally, because the learning goal is clear, the input derived from a specific noised GT will be responsible for predicting its corresponding true value, thus avoiding the dichotomous phenomenon introduced by Hungarian matching.
The construction of the denoising query consists of two main components: labeling noise and 3D box noise. Additionally, this algorithm applies various noise addition scenarios to the ground truth (GT), as illustrated in
Figure 6. Specifically, assuming a batch of data contains d real GT values repeated n times to construct n groups of different denoising queries, the total number of denoising queries is n×d, which is denoted by D. For label noise addition, the algorithm first generates D probability values conforming to a normal distribution. It then uses a preset probability p to screen k indexes of D probability values that are smaller than p. Next, k positive integers representing random labels within the range of the number of categories are generated. These k random labels are then adjusted according to the indexes, thus modifying the values at the real label locations after n repetitions to the labels with added noise. Finally, a fully connected layer encodes this noise query to obtain the noise query
.
Finally, the original content query and the noise query are spliced in the number dimension to obtain the hybrid content query .
For 3D box noise addition containing six parameters
, the main operations of this algorithm are centroid displacement and scale scaling, while to ensure small perturbation, the 3D frame parameters are first normalized to 0−1. For centroid displacement, 1 perturbation parameter is first sampled from a uniform distribution
, and then the offset corresponding to the center point
is caluclated as follows:
and this constrains
Similarly, 1 perturbation parameter
is sampled from the uniform distribution, and then also the offset corresponding to
is computed separately:
The length, width, and height of the final 3D box is scaled to the interval [0, 2]. According to the dynamic position encoding algorithm above, it is known that the noise-added 3D box represents the reference point, which needs to be used to generate the position encoding, while the subsequent query of the decoder is the superposition of the content query and the position encoding.
After completing the design of the noise query, the current query has been changed to the hybrid query ; this query subsequently acts in the same way as the usage DETR model, but in the decoder, the hybrid query needs to go through the process of going through a self-attention module, which performs a global interaction. Thus, the query used for matching fetches the content of the noise query, which leads to information leakage, since the noise query is noised from the true-value GT information. Therefore, this section designs a mask module, which serves two purposes: firstly, it prevents the query used for matching from interacting with the noise query for information, and secondly, it prevents the noise query from interacting with the information between different groups. This ensures that the matching task and the denoising task are independent of each other and do not interfere with each other. Also, whether the denoising part can see the matching part does not affect the performance, because the queries in the matching part are learned queries and do not contain information about the GT objects.
This algorithm uses
matrix to represent the self-attention mask, where
, K is the number of queries used for matching, d is the number of true-valued GTs corresponding to this set of queries, and n represents the different denoising groups; if
, it means that the ith query can interact with the jth query; if
, it means that the ith query cannot interact with the jth query. As shown in
Figure 4, the first K rows and K columns of the self-attention mask represent the matching part, and the subsequent K + d rows and K + d columns represent the part of a denoising group, of which there are n, and so on. The final noisy query prediction yields categories with 3D box sizes that still require the computation of the loss
:
where
is used to control the weight of the auxiliary noise loss,
P represents the predicted value,
G represents the true value, and
represents the Smooth L1 loss, and since the auxiliary noise is noise added to the true value, the values of
P and
G are one-to-one, without the need to perform a positive–negative sample matching, which is theoretically equivalent to increasing the proportion of the positive samples, which is an intrinsic part of the validity of the approach This is also the intrinsic reason for the effectiveness of the method.