0% found this document useful (0 votes)
31 views11 pages

Autoshape: Real-Time Shape-Aware Monocular 3D Object Detection

This paper proposes a new approach called AutoShape for real-time shape-aware monocular 3D object detection. It uses a deep neural network to learn 2D keypoints in images and their corresponding 3D coordinates on 3D object models. Geometric constraints are built between the 2D and 3D keypoints to improve detection performance. An automatic model-fitting approach is used to generate ground truth 2D/3D keypoint correspondences by fitting 3D object models to 2D object masks. The proposed method achieves state-of-the-art performance on the KITTI dataset while running in real-time.

Uploaded by

cheng peng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
31 views11 pages

Autoshape: Real-Time Shape-Aware Monocular 3D Object Detection

This paper proposes a new approach called AutoShape for real-time shape-aware monocular 3D object detection. It uses a deep neural network to learn 2D keypoints in images and their corresponding 3D coordinates on 3D object models. Geometric constraints are built between the 2D and 3D keypoints to improve detection performance. An automatic model-fitting approach is used to generate ground truth 2D/3D keypoint correspondences by fitting 3D object models to 2D object masks. The proposed method achieves state-of-the-art performance on the KITTI dataset while running in real-time.

Uploaded by

cheng peng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 11

AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection

Zongdai Liu, Dingfu Zhou *, Feixiang Lu, Jin Fang and Liangjun Zhang
Robotics and Autonomous Driving Laboratory, Baidu Research,
National Engineering Laboratory of Deep Learning Technology and Application, China
{liuzongdai, zhoudingfu, lufeixiang, fangjin, liangjunzhang}@baidu.com
arXiv:2108.11127v1 [cs.CV] 25 Aug 2021

Abstract

Existing deep learning-based approaches for monocular


3D object detection in autonomous driving often model the
object as a rotated 3D cuboid while the object’s geomet-
ric shape has been ignored. In this work, we propose an
approach for incorporating the shape-aware 2D/3D con-
straints into the 3D detection framework. Specifically, we
employ the deep neural network to learn distinguished 2D
keypoints in the 2D image domain and regress their cor-
responding 3D coordinates in the local 3D object coordi-
nate first. Then the 2D/3D geometric constraints are built Figure 1: (a): 3D Bbox corners and center are commonly used
by these correspondences for each object to boost the de- for monocular 3D object detection. However, the rich structure
information from 3D vehicle shapes and their projection on 2D
tection performance. For generating the ground truth of
images are not employed. (b) shows our shape-aware constraints
2D/3D keypoints, an automatic model-fitting approach has
constructed from an aligned 3D model. Such 2D-3D keypoints
been proposed by fitting the deformed 3D object model and carry more semantic and geometric information and enable to con-
the object mask in the 2D image. The proposed frame- struct stronger geometric constraints for monocular 3D detection.
work has been verified on the public KITTI dataset and
the experimental results demonstrate that by using addi-
tional geometrical constraints the detection performance fore, monocular camera based 3D object detection becomes
has been significantly improved as compared to the base- a promising direction.
line method. More importantly, the proposed framework The main challenge for monocular-based approaches is
achieves state-of-the-art performance with real time. Data to obtain accurate depth information. In general, depth es-
and code will be available at https://github.com/ timation from a single image without any prior information
zongdai/AutoShape is a challenging problem and recent many deep learning-
based approaches achieve good results [7]. With the esti-
mated depth map, pseudo LiDAR point cloud can be re-
1. Introduction constructed via pre-calibrated intrinsic camera parameters
Perceiving 3D shapes and poses of surrounding obsta- and 3D detectors designed for LiDAR point cloud can be
cles is an essential task in autonomous driving (AD) per- applied directly on pseudo LiDAR point cloud [33] [37].
ception systems. The accuracy and speed performance of Furthermore, [33] integrates the depth estimation and 3D
3D objection detection is important for the following mo- object detection network together following an end-to-end
tion planning and control modules in AD. Many 3D ob- manner. However, heavy computation burden is one main
ject detectors [50, 14] have been proposed, mainly for depth bottleneck of such two-stage approaches.
sensors such as LiDAR [35, 45] or stereo cameras [46, 18], To improve the efficiency, many direct regression-
which can provide the distance information of the environ- based approaches have been proposed (e.g., SMOKE [24],
ments directly. However, LiDAR sensors are expensive and RTM3D [19]) and achieved promising results. By repre-
stereo rigs suffer from on-line calibration issues. There- senting the object as one center point, the object detec-
tion task is formulated as keypoints detection and its cor-
* Corresponding author responding attributes (e.g., size, offsets, orientation, depth,
etc.) regression. With this compact representation, the com- approaches have proposed to use a single image frame.
putation speed of this kind of approach can reach 20∼30 Generally, these approaches can be categorized into three
fps (frame per second). However, the drawback is also ob- types: depth-map-based, direct regression-based, and CAD
vious. One center point [48, 24] representation ignores the model-based methods.
detailed shape of the object and results in location ambi- Depth-map-based methods [39] usually need to estimate
guity if its projected center point is on another object’s sur- the depth map first. In [40] and [37], the estimated depth
face due to occlusion [47]. To alleviate this ambiguity, other map is transformed into point clouds, and then point-cloud-
geometrical constraints have been used to improve the per- based 3D object detectors are employed for achieving the
formance. RTM3D [19] adds 8 more keypoints as addi- detection results. Rather than transforming the depth map
tional constraints which are defined as the projected 2d lo- into point clouds, many approaches propose using the depth
cation of the 3D bounding box’s corners. However, these estimation map directly in the framework to enhance the
keypoints have non-real context meanings and their 2D lo- 3D object detection. In M3D-RPN [1] and [5], the pre-
cations vary differently with the changing of the camera estimated depth map has been used to guide the 2D con-
view-point, even the object’s orientation. As shown in the volution, which is called as “Depth-Aware Convolution”.
left of Fig. 1, some keypoints are on the ground and some Direct regression-based methods are proposed to estimate
are on the sky or trees. This makes the keypoints detection the objects’ 3D information via image domain directly, such
network extremely difficult to distinguish the keypoints or as [16, 24, 48, 47]. Direct-based methods are much more
other image pixels. efficient than depth-map-based methods because the depth-
In this paper, we propose a novel approach to learn the map computation procedure is not necessary.
meaningful keypoints on the object surface and then use In order to well benefit the prior knowledge, the shape
them as additional geometrical constraints for 3D object de- information has been integrated into the CAD-based ap-
tection. Specifically, we design an automatic deformable proaches. Deep MANTA [2] and ApolloCar3D [36] are
model-fitting pipeline first to generate the 2D/3D corre- two keypoints based methods, in which the 3D keypoints
spondences for each object. Then, the center point plus sev- are pre-defined on the CAD model and their correspond-
eral distinguished keypoints are learned from the deep neu- ing 2D points on the image plane are computed by the deep
ral network. Based on these keypoints and other regressed neural network. Then the 3D pose can be solved with a
objects’ attributes (e.g., orientation angle, object dimension standard 2D/3D pose solver [15] with these 2D/3D cor-
etc.), the object’s 3D bounding box can be solved with lin- respondences. Besides keypoints-based methods, dense-
ear equations. The proposed framework can be trained in matching-based approaches are proposed in [12, 10, 29]. In
an end-to-end manner. Our contributions include: [12], Rendering-and-Compare loss is designed for optimiz-
ing the 3d pose estimation. While in [10] and [29], the 3D
1. We propose a shape-aware 3d object detection frame- pose estimation and reconstruction of each object are gen-
work, which employs keypoints geometry constraints for erated simultaneously with the deep neural network.
2D/3D regression to boost the detection performance.
2.2. Data Labeling for 3D Object Detection
2. We present a method for automatically fitting the 3D
shape to the visual observations and then generating For easy representation, objects are usually described as
ground-truth annotations of 2D/3D keypoints pairs for 3D cuboids in deep learning frameworks while the shape in-
the training network. Our source code and dataset will formation has been totally ignored. Manually label the ob-
be made public for the community. ject shape via only the image observation is extremely dif-
ficult and the annotation quality also can not be guaranteed.
3. The effectiveness of our approach has been verified Many CAD model guided annotation approaches have been
on the public KITTI dataset and achieved SOTA per- proposed to obtain the dense shape annotations. In [30],
formance. More importantly, the proposed framework both the stereo image and the sparse LiDAR point cloud
achieves real-time (25 fps), which can be integrated into has been employed for generating the dense scene flow for
the AD perception module. both foreground and background pixels. For dynamic ob-
jects, 16 vehicle models are chosen as basic templates and
2. Related Work then the dense annotation is achieved by finding an optimal
3D similarity transformation (e.g., the pose and scale of the
2.1. Monocular-based 3D Detection 3D model) with three types of observations such as LiDAR
Image-based 3D object detection becomes popular due points, dense disparity computed by SGM [8] and labeled
to the cheap price of the camera sensors. Stereo-based 2D/3D correspondences.
approaches usually suffer from calibration issues between In [36], 66 keypoints are defined on the 3D CAD mod-
two camera rigs. Therefore, many 3D object detection els and annotators label their corresponding 2D keypoints
on the image. Based on the 2D/3D correspondences, the all these parameters are regressed directly without impos-
object poses can be obtained via a PnP solver. In [42], ing addition constraints. Indeed, both 3D object detection
the authors apply a differentiable shape renderer to signed and pose estimation are essentially the same problem and
distance fields (SDF), leveraged together with normalized (r, t) can be easily transformed from (R, T). Therefore, we
object coordinate spaces (NOCS) to automatically generate explicitly employ geometric constraints in pose estimation
the dense 3D shape without the 3D bounding boxes anno- formulation to improve the learning-based 3D object detec-
tation. Although the whole process is labor-free, the anno- tion.
tation quality is far-from the ground truth. Different from
[42], we use the ground truth 3D bounding boxes as strong 4. Proposed Method
guidance for our 3D shape annotation generation process.
In this section, we propose a general deep learning-
based 3D object detection framework, which can employ
3. Problem Definition
the 2D/3D geometric constraints. To well explore the prior
Before the introduction of our proposed approach, a gen- knowledge, CAD models are employed here. First, we pre-
eral description of image-based 3D object detection prob- define several distinguished 3D keypoints on CAD models.
lem is introduced first. Then, we propose to build the correlation between these 3D
keypoints and their 2D projections on the image resorting to
3.1. Pose Estimation the deep learning network. Finally, the object pose can be
Given an image, the task of pose estimation is to esti- easily solved with these geometrical constraints. More im-
mate the orientation and translation of objects in 3D. Specif- portantly, all the processes are implemented into the neural
ically, 6D pose is represented by a rigid transformation network, which can be trained in an end-to-end manner.
(R, T) from the object coordinate system to the camera co-
4.1. Point-wise 2D-3D Constraints
ordinate system, where R represents the 3D rotation and T
represents the 3D translation. Assuming a 3D point Pio (xio , yoi , zoi ) in object local co-
Assuming a 3D object point Po (xo , yo , zo ) in the object ordinate, then its projection location (ui , v i ) on the image
coordinate system, transformed 3d point Pc (xc , yc , zc ) in plane can be obtained based on Eq. 1 and Eq. 2 as
camera coordinate can be obtained as  i 
 i 
xo
u
[xc , yc , zc ]T = R[xo , yo , zo ]T + T, (1)
 yoi 
s  vi  = K R T 
 
 zoi  . (4)

where R is rotation matrix, T istranslation 1
 vector. Given 1
fx 0 c x
the camera intrinsic matrix K =  0 fy cy  , the projected In autonomous driving scenario, the road surface that the
0 0 1 object lies on is almost flat locally, therefore the orientation
image point p (u, v) can be obtained as parameters are reduced from three to one by keeping only
the yaw angle ry around the Y-axis. Therefore the rotation
s[u, v, 1]T = K[xc , yc , zc ]T . (2)
 
cos(ry ) 0 sin(ry )
matrix R becomes as  0 1 0  and Eq. 4 can
Based on Eq. 1 and Eq. 2, the object pose R and T can −sin(ry ) 0 cos(ry )
be theoretically recovered with the geometric constraints be simplified as
between 3D points Po on the object and the projected 2D
xio cos(ry ) + zoi sin(ry )+
   
 T
image points p. −1 0 ũi  x  

Tz = ũi [xio sin(ry ) − zoi cos(ry )]  ,
0 −1 ṽ i
3.2. Learning-based 3D Object Detection Tz yoi + ṽ i [xio sin(ry ) − zoi cos(ry )]
(5)
In the era of deep learning, many approaches have been where ũi = (ui − cx )/fx , ṽ i = (v i − cy )/fy and T =
proposed to detect objects and directly regress their poses [Tx , Ty , Tz ]T . As described in Eq. 5, for each object point,
using neural networks, while geometric 2D/3D constraints two constraints are given. If n points are provided, n × 2
have been ignored in the formulation. Image-based 3D ob- constraints can be obtained as
ject detection is a typical task, which aims at estimating the
location, orientation of an object in the camera coordinate. AT = B, (6)
Usually, an object is represented as a rotated 3D BBox as where
−1
 
0 ũ1
r = (rx , ry , rz ); t = (tx , ty , tz ); d = (l, w, h), (3)  0 −1 ṽ1 
 
A= .. ,
in which r, t represent the object’s orientation, location in
 
 . 

the camera coordinate and d is the dimension of the ob-  −1 0 ũn 
ject. With the super expression ability of neural networks, 0 −1 ṽn 2n×3
Figure 2: Overview of the proposed keypoints-based 3D detection framework. By passing the backbone network, 8 branch heads
are followed for center point classification, center point offset, 2D keypoints, 3D coordinates, keypoints confidence, object orientation,
dimension, and 3D detection score regression purpose. Finally, all the regressed information has been employed for recovering the object’s
3D Bbox in the camera coordinate.

and 4.2. Network


x1o cos(ry ) + zo1 sin(ry ) + ũ1 (x1o sin(ry ) − zo1 cos(ry ))
 
An overview of our proposed framework is illustrated in
yo1 + ṽ 1 (x1o sin(ry ) − zo1 cos(ry ))
Fig. 2. Here, we follow one-stage-based 3D object detec-
 
 
B=
 .. 
.
 . 
 tion framework such as CenterNet [48] for its inference ef-
 xn cos(r ) + z n sin(r ) + ũn (xn sin(r ) − z n cos(r )) 
o y o y o y o y ficiency. Our proposed framework is backbone independent
yon + ṽ n (xn n
o sin(ry ) − zo cos(ry )) 2n×1 and here we employ DLA-34 [41] in our implementation.
However, in the real AD scenario, not all the key- Given an image I with width W and height H, the out-
points can be seen from a certain camera viewpoint. For put feature map will be 4 times smaller than I after passing
these keypoints which have been seriously occluded, the through the backbone network. To well utilize the geomet-
2D/3D keypoints regression can not be well guaranteed. To ric constraints, the following information is required to be
well handle this kind of uncertainty, we propose to output learned from the deep neural network.
an additional score to measure the confidence of each key- Object Center: in anchor-free based object detec-
point. And this score can be used as a weight during the tion frameworks, the object center is essential information,
pose calculation process. Specifically, for the 2n constrains which serves two functions: one is whether there is an ob-
in Eq. 6, an additional weights c = {c1 , c2 , ..., c2n } have ject and the other is that if there exists an object, where
been added to determine their importance during the pose is the center. Usually, these two functions are realized by
calculation procedure. Therefore, Eq. 6 can be reformu- a classification branch by distinguishing a pixel whether is
lated as an object center or not. The output of this branch will be
W W
diag(c)AT = diag(c)B (7) 4 × 4 × C, where C is the number of classes.
Besides the classification, an additional “offset” regres-
In this linear system, T represents the object location in sion branch is required to compensate for the quantization
the camera coordinate system, which can be solved by pro- error during the down-sampling process. The output of this
viding 2D/3D correspondences and the rotation angle ry . branch is W H
4 × 4 × 2 to represent the offset in x and y
Here, the 3D keypoints are defined in the local object’s co- direction respectively.
ordinate varying in a relatively small range and the 2D key- Object Dimension: a separate branch is used to regress
points are defined in the image domain. Both of them are the object dimension h, w, l with the output size of W 4 ×
H
easy for networks to learn. However, manual labeling of the 4 × 3. Similar to other approaches, we don’t regress the
ground truth for 2D and 3D keypoints is very costly and te- absolute object’s size directly and regress a relative scale
dious. Therefore, we develop an auto-labeling pipeline by compared to the mean object size of each class. Details
optimizing the 2D and 3D reprojection errors. The detailed operation can be found in [48].
annotation pipeline will be introduced in Section 5. 2D Keypoints: rather than directly detect these key-
points from the image, we regress n ordered 2D offset coor- 3D locations in the local object coordinate for training the
dinates for each object center. The benefit is that the number network. The main process is illustrated in 3. Different
and order of keypoints for each object can be well guaran- from existing methods which use only a few CAD models
teed. In addition, the regression for the offset is easier by for 3D labeling (e.g., 11 in [42] and 16 in [30]), we adopt a
removing the object center. The output size of this branch 3D deformable vehicle template [36, 25] that can represent
is W H
4 × 4 × 2n. arbitrary vehicle shape by adjusting the parameters. There-
3D Keypoints: similar to the 2D keypoints, we regress fore, the 3D shape labeling process can be formulated as
the 3d keypoints in the local object coordinate. In addition, an optimization problem that aims at computing the optimal
all 3D keypoints values are normalized by object dimension parameter combination to fit the visual observations (i.e. 2D
(l, w, h) in x, y, z-direction respectively. By using this for- instance mask, 3D bounding box, and 3D LiDAR points).
mat, the 3D keypoints values are in a relatively small range,
which will benefit the whole regression process. The output 5.1. Deformable Vehicle Template
size of this branch is W H
4 × 4 × 3n. In the real-world traffic scenarios, there are many differ-
Object Orientation: similarly, we regress local orien- ent vehicle types (e.g., coupe, hatchback, notchback, SUV,
tation angle with respect to the ray through the perspective MPV, etc.) and their geometric shapes vary significantly. To
point of 3D center following Multi-Bin based method [31]. perform 3D shape fitting, a straight-forward solution is to
Here, 8 bins are used with the output size of W W
4 × 4 × 8. build a 3D shape dataset and the fitting process can be re-
Keypoints Confidence Scores: for each keypoint, a garded as model retrieval. However, dataset construction is
couple of additional confidence scores have been regressed labor-intensive, inefficient, and costly. Instead, we use a de-
for measuring its contribution in the linear system for solv- formable 3D model for vehicle representation [25]. Specif-
ing the object pose. For 2n constrains in Eq . 6, a feature ically, this template is composed of a set of PCA (Principal
map with size of W W
4 × 4 × 2n will be outputted. Components Analysis) basis r. Any new 3D vehicle M(s)
3D IoU Confidence Score: rather using the classifica- can be represented as a mean shape model M0 plus a lin-
tion score directly as the object detection confidence, we ear combination of r principal components with coefficient
add one branch to regress the 3D IoU score in purpose. This s = [s1 , s2 , ..., sr ] as
score is supervised by the IoU between estimated Bbox and
ground truth Bbox. Finally, the product of this score and r
X
the output classification score is assigned as the final 3D M(s) = M0 + sk δk pk , (9)
detection confidence score. k=1

4.3. Loss Function where pk and δk are the principal component direction and
corresponding standard deviation and sk is the coefficient of
The overall loss contains the following items: a center the kth principal component. Based on the vehicle template,
point classification loss lm and center point offset regres- we can automatically fit an optimal 3D shape to the visual
sion loss lof f , a 2D keypoints regression loss l2D , a 3D observations (details in Subsec. 5.2).
keypoints points regression loss l3D , an orientation multi-
bin loss lr , a dimension regression loss lD , a 3D IoU confi- 5.2. 3D Shape Optimization
dence loss lc and a 3D bounding box IoU loss lIoU . Specif-
For each vehicle, our goal is to assign a proper 3D shape
ically, the multi-task loss is defined as
to fit the visual observations, including the 2D instance
L = wm lm + wof f lof f + w2D l2D + w3D l3D + mask, the 3D bounding box, and the 3D LiDAR points.
(8)
wr lr + wD lD + wc lc + wIoU lIoU Specifically, the annotations of 2D instance mask Iins and
the 3D bounding box Bbox are provided by the KINS
where lm is the focal loss as used in [48], l2D is a depth-
dataset [32] and the KITTI dataset [6], respectively. The an-
guided l −1 loss as used in [17], lD and l3D are L1 loss with
notation of 3D LiDAR points for each vehicle is much more
respect to the ground truth. Orientation loss lr is the Multi-
complex. According to the labeled 3D bounding boxes, we
Bin loss. 3D IoU confidence lc is a binary cross-entropy
first segment out the individual 3D points from the entire
loss supervised by the IoU between the predicted 3D BBox
raw point cloud. Then we remove the ground points using
and ground truth. lIoU is the IoU loss between the predicted
the ground-plane estimation method (i.e. RANSAC-based
3D BBox and ground truth [44].
plane fitting). Finally, we obtain the “clean” 3D points for
each vehicle, which is represented as p = {p0 , ..., pk }.
5. 3D Shape Auto-Labeling
The 3D shape annotation is to compute the best PCA
In this section, we will introduce how to automatically fit s and the object 6-DoF pose (R,
coefficient b t). Existing
b b
the 3D shape to the visual observations and then automati- 3D object detection benchmarks (e.g., KITTI, Waymo) only
cally generate ground-truth annotations of 2D keypoints and label the yaw angle because they assume vehicles are on the
Figure 3: The pipeline of the proposed 3D shape-aware auto-labeling framework. By providing different kinds of vehicle CAD
models, a mean shape template and r principal basic can be obtained. By giving a 3D object sample in the KITTI dataset, the optimal
principal components with coefficient s and 6-DoF pose can be iteratively optimized by minimizing the 2D and 3D losses which are defined
on the scanned sparse point cloud and the instance mask.

Figure 4: Illustration of the 3D shape optimization process Figure 5: An example of CAD model-fitting results with one
from step 0 to step 200. From the top to bottom, the rendered angle (yaw) only or three angles (yaw, roll, and pitch). When
mask gradually covers the target mask and 3D model vertices align the road is not flat, serious misalignment will happen if only yaw
with the point cloud gradually and the 3D deformed shape changes angle is applied for optimization.
from a mean shape to a ‘notchback’.

For each 3D point pi ∈ p, we find its nearest neighbor


road plane. However, we experimentally (an example has
vi in v(s̃) and compute the distance between them. L3D is
been given in Fig. 5) find that the other two angles (i.e. pitch
defined as the sum distance over all correspondences pairs
and roll) can significantly improve the 3D shape annotation
results. Therefore, the loss function is formulated as X
L3D = kpi − v(s̃)i k . (12)
pi ∈p
s, R,
b b bt = arg min{αL2D (P r (M (s) , R, t) , Iins )
s,R,t (10)
The function of Eq. 10 can be optimized by the gradient
+βL3D (v (s) , R, t, p)},
descent strategy. The vehicle’s center position and orienta-
which consists of 3D points loss L3D and 2D instance loss tion are used for t and yaw angle initialization, while the
L2D . α, β are two hype-parameters to balance these two pitch, roll angles, and PCA coefficients are initialized as ze-
constraints. ros. Then we forward the pipeline and compute L2D and
The operation P r (·) is a differentiable rendering func- L3D and finally back-propagate gradients to update s, R,
tion to produce the binary mask I˜ins of M (s) with trans- t. In Fig. 4, we depict the intermediate result during the
lation {R, t}. Specifically, the L2D is defined as the sum optimization process.
distance over each pixel (i, j) in image I
6. Experimental Results
X ˜

L2D = Iins (i, j) − Iins (i, j) . (11)

We implement the approach and evaluate it on the public
(i,j)∈I KITTI [6] 3D object detection benchmark.
AP3D 70 (%) APBEV 70 (%)
Methods Modality
Moderate Easy Hard Moderate Easy Hard
Time (s) wiou are increased from 0 to 1 with exponential RAMP-UP
M3D-RPN [1] Mono 9.71 14.76 7.42 13.67 21.02 10.23 0.16
SMOKE [24] Mono 9.76 14.03 7.84 14.49 20.83 12.75 0.03
strategy [13]. We use Adam optimizer with a base learning
MonoPair [4]
RTM3D [19]
Mono
Mono
9.99
10.34
13.04
14.41
8.65
8.77
14.83
14.20
19.28 12.89
19.17 11.99
0.06
0.05
rate of 0.0001 for 200 epochs and reduce by 10× at 100 and
AM3D [27] Mono∗ 10.74 16.50 9.52 17.32 25.03 14.91 0.4 160 epochs. We project the ground truth to corresponding
PatchNet [26] Mono∗ 11.12 15.68 10.17 16.86 22.97 14.97 0.4
RefinedMPL [37] Mono∗ 11.14 18.09 8.94 17.60 28.08 13.95 0.15 right image and use random scaling (between 0.6 to 1.4),
KM3D [17] Mono 11.45 16.73 9.92 16.20 23.44 14.47 0.03
D4LCN [5] Mono∗ 11.72 16.65 9.51 16.02 22.51 12.55 0.20 random shifting in the image range, and color jittering for
IAFA [47] Mono 12.01 17.81 10.61 17.88 25.88 15.35 0.03
YOLOMono3D[22] Mono 12.06 18.28 8.42 17.15 26.79 12.56 0.05
data augmentation. The network is trained on 2 NVIDIA
Monodle[28]
MonoRUn[3]
Mono
Mono
12.26
12.30
17.23
19.65
10.29
10.58
18.89
17.34
24.79 16.00
27.94 15.24
0.04
0.07
Tesla V100 (16G) GPU cards and the batch size is set to
GrooMeD-NMS[11] Mono 12.32 18.10 9.65 18.27 26.19 14.05 0.12 16. For the KITTI test set evaluation, we sample 16/48 key-
DDMP-3D[38] Mono 12.78 19.71 9.80 17.89 28.08 13.44 0.18
Ground-Aware[23] Mono 13.25 21.65 9.91 17.98 29.81 13.08 0.05 points from the 3D shape, 8 corner points, and 1 center to
CaDDN[34] Mono∗ 13.41 19.17 11.46 18.91 27.94 17.19 0.63
MonoEF[49] Mono 13.87 21.29 11.71 19.70 29.03 17.26 0.03 train network.
MonoFlex[43] Mono 13.89 19.94 12.07 19.75 28.23 16.89 0.03
Baseline Method[17] Mono 11.45 16.73 9.92 16.20 23.44 14.47 0.03
AutoShape-16kps Mono† 13.72 21.75 10.96 19.00 30.43 15.57 0.04
AutoShape-48kps Mono† 14.17 22.47 11.36 20.08 30.66 15.59 0.05 6.2. Data Auto-Labeling Evaluation
Improvements - +2.72 +5.74 +1.44 +3.88 +7.22 +1.12 -
Our approach can automatically generate the 2D key-
Table 1: Comparison with other public methods on the KITTI points and their corresponding 3D locations in the local
testing server for 3D “Car” detection. For the “direct” meth- object coordinate which are employed as the supervision
ods, we represent the “ Modality” with “Mono” only. We use ∗ to
signal during the training process. To verify the quality
indicate that the “depth” has been used by these methods during
training and inference procedure. † indicates that ’CAD models’
of the labeling results, the 2d instance segmentation mean
have been used in data labeling stage. For easy understanding, we AP and 3D bounding box mean AP is used here for ver-
have highlighted the top numbers in red for each column and the ification. Specifically, the 2D instance segmentation IoU
second best is shown in blue. is calculated using the projected mask by 3D models and
the ground truth mask (from KINS [32]). We obtain the
6.1. Dataset and Implementation Details labeled 3D bounding box using the dimension of the 3D
model with the optimized 6-DoF pose, which is compared
Dataset: the KITTI dataset is collected from the real to the ground-truth 3D bounding box provided by [6] with
traffic environment from the Europe streets. The whole the mean IoU score. Tab. 2 shows the detailed comparison
dataset has been divided into training and test two sub- results. The proposed method can achieve 0.86 for 2D mean
sets, which consist of 7, 481 and 7, 518 frames, respectively. AP and 0.76 for 3D mean AP, which justifies the effective-
Since the ground truth for the test set is not available, we di- ness of our auto-labeling approach.
vide the training data into a train set and a val set as in [50],
and obtain 3, 712 data samples for training and 3, 769 data 6.3. Evaluation for 3D Object Detection
samples for validation to refine our model. On the KITTI
benchmark, the objects have been categorized into “Easy”, The evaluation of the proposed approach with other
“Moderate”, and “Hard” based on their height in the image SOTA methods for 3D detection detection on KITTI [6] test
and occlusion ratio, etc. set are given in Tab. 1. From the table, we can obviously
Evaluation Metric: we focus on the evaluation on “Car” find that the proposed method with 48 keypoints achieves
category because it has been considered most in the previ- 4 first places in 6 tasks with the AP |R40 metric. We also
ous approaches. For evaluation, the average precision (AP) report our method with 16 keypoints, which has faster in-
with Intersection over Union (IoU) is used as the metric for ference time and keeps promising accuracy. In addition,
evaluation. Our AutoShape approach is compared with ex- most of the existing methods such as [27, 26, 37, 5, 34],
isting methods on the test set using APR40 by training our need to estimate the depth map, resulting in a heavy com-
model on the whole 7, 481 images. We evaluate on the val putation burden in inference. In contrast, our method ob-
set for ablation by training our model on the train set using tains the depth information by 3D shape-aware geomet-
APR40 . ric constraints, which is more accurate with faster run-
Implementation Details: We implement our auto- ning speed. We achieve 25 FPS with an NVIDIA V100
labeling approach (Sec. 5) using differentiable renderer GPU card with 16 keypoints configuration. Compared
[9, 20, 21] which is optimized by using the Adam optimizer with baseline geometric constraint methods [17, 19] using
with a learning rate of 0.002. To speed the optimization and 8 corners and 1 center point as keypoints for training, our
save memory, we downsample the PCA model to 666 ver- method with 48 keypoints utilizes more shape-aware key-
tices and 998 faces. We set α and β to 1.0 and 5.0, respec- points to construct stronger geometric constraints, getting
tively. Our shape-aware 3D detection network uses DLA- +5.74%, +2.72%, 1.44%, +7.22%, +3.88%, +1.12% im-
34 [41] as backbone. We pad the image size to 1280 × 384. provements for AP3D and APBEV on “Easy”, “Moderate”,
3D IoU confidence loss weight wc and 3D IoU loss weight and “Hard” categories.
6.4. Qualitative Results
Qualitative results of 3D shape auto-labeling are shown
in Fig 6. Each vehicle in the image is overlaid with a ren-
dered 3D model optimized by our method. We can see
the consistency of our labeled shape and the real object.
We also visualize some representative results of our shape-
aware model in Fig 7. Our model can predict object location
accurately even for distant and truncated objects.

Figure 8: 3D object detection performance with different number


Figure 6: Qualitative results of our 3D shape auto-labeling. of keypoints on KITTI val set using AP |R40 .

the effectiveness of the 2D/3D loss. We first only use 3D


point loss L3D in the objective function. Then we only use
the 2D mask loss L2D for optimization. Finally, we take
both 2D/3D loss into computation. Tab. 2 shows that us-
ing both 2D/3D loss get the best performance in the auto-
labeling process. We further observe that the impact of 2D
mask loss L2D is more important than 3D point loss. By
using both L2D and L3D , the labeling accuracy is improved
Figure 7: Qualitative 3d detection results on KITTI validation set. to 86.35 and 76.92, resulting in better 3D detection perfor-
Red boxes represent our predictions, and green boxes come from mance. This correlation indicates that 3D detection perfor-
ground truth. LiDAR signals are only used for visualization. mance can be significantly improved by using high-quality
labeling data of 3D shapes.
6.5. Ablation Studies
Label Acc. Car 3D Det.
The Number of Keypoints: our shape-aware 3D detec- L2D L3D
I2D I3D Easy Mod. Hard
tion network benefits from the geometric constraint of 2D- X 0.61 0.71 15.49 11.35 9.34
3D keypoints from the 3D shape. To better understand the X 0.82 0.74 18.36 13.88 11.23
effect of different numbers of the keypoints, we set it from X X 0.86 0.76 20.09 14.65 12.07
0 to 48 with an interval of 8. Note that the 8 corners and 1
center point are always maintained in this experiment and Table 2: Shape Autolabeling Ablation Experiments on KITTI val
set using AP |R40 .
we vary the extra keypoints. As shown in Fig. 8, from 0
to 16, the network performance is significantly improved. 7. Conclusion
From 16 to 48, however, we observe that the network per-
formance is not sensitive to the number of the keypoints. In this paper, we present a framework for real-time
The main reason is that more dense 2D shape points can be monocular 3D object detection by explicitly employing
overlapped in the W H
4 × 4 heatmap during the regression shape-aware geometric constraints between 3D keypoints
process. Furthermore, with more keypoints, the network and their 2D projections on images. Both the 3D keypoints
consumes more GPU memory for storage and computation, and 2D project points are learned from deep neural net-
resulting in longer training and inference time. In practice, works. We further design an automatic annotation pipeline
we set the number of extra keypoints to 16, which is a good for labeling object 3D shape, which can automatically gen-
compromise of accuracy and efficiency. erate the shape-aware 2D/3D keypoints correspondences for
2D/3D Loss for Auto-Labeling: our auto-labeling ap- each object. Experimental results show our approach can
proach (Sec. 5) can generate precise posed 3D shape for achieve state-of-the-art detection accuracy with real-time
each 2D vehicle instance. The key technique is simulta- performance. Our approach is general for other types of
neously optimizing the 2D/3D constraints (loss) for better vehicles, and in the future, we are interested in validating
matching. Here, we conduct an ablation study to justify the performances of our approach on other objects.
References [13] Samuli Laine and Timo Aila. Temporal ensembling for semi-
supervised learning. arXiv preprint arXiv:1610.02242, 2016.
[1] Garrick Brazil and Xiaoming Liu. M3d-rpn: Monocular 3d 7
region proposal network for object detection. In Proceedings
[14] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou,
of the IEEE International Conference on Computer Vision,
Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders
pages 9287–9296, 2019. 2, 7, 11
for object detection from point clouds. In Proceedings of
[2] Florian Chabot, Mohamed Chaouch, Jaonary Rabarisoa,
the IEEE/CVF Conference on Computer Vision and Pattern
Céline Teuliere, and Thierry Chateau. Deep manta: A
Recognition, pages 12697–12705, 2019. 1
coarse-to-fine many-task network for joint 2d and 3d vehi-
cle analysis from monocular image. In Proceedings of the [15] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua.
IEEE conference on computer vision and pattern recogni- Epnp: An accurate o (n) solution to the pnp problem. Inter-
tion, pages 2040–2049, 2017. 2 national journal of computer vision, 81(2):155, 2009. 2
[3] Hansheng Chen, Yuyao Huang, Wei Tian, Zhong Gao, and [16] Buyu Li, Wanli Ouyang, Lu Sheng, Xingyu Zeng, and Xiao-
Lu Xiong. Monorun: Monocular 3d object detection by re- gang Wang. Gs3d: An efficient 3d object detection frame-
construction and uncertainty propagation. In Proceedings of work for autonomous driving. In Proceedings of the IEEE
the IEEE/CVF Conference on Computer Vision and Pattern Conference on Computer Vision and Pattern Recognition,
Recognition, pages 10379–10388, 2021. 7 pages 1019–1028, 2019. 2
[4] Yongjian Chen, Lei Tai, Kai Sun, and Mingyang Li. [17] Peixuan Li. Monocular 3d detection with geometric con-
Monopair: Monocular 3d object detection using pairwise straints embedding and semi-supervised training, 2020. 5,
spatial relationships. arXiv preprint arXiv:2003.00504, 7
2020. 7, 11 [18] Peiliang Li, Xiaozhi Chen, and Shaojie Shen. Stereo r-cnn
[5] Mingyu Ding, Yuqi Huo, Hongwei Yi, Zhe Wang, Jianping based 3d object detection for autonomous driving. In Pro-
Shi, Zhiwu Lu, and Ping Luo. Learning depth-guided convo- ceedings of the IEEE/CVF Conference on Computer Vision
lutions for monocular 3d object detection. In Proceedings of and Pattern Recognition, pages 7644–7652, 2019. 1
the IEEE/CVF Conference on Computer Vision and Pattern [19] Peixuan Li, Huaici Zhao, Pengfei Liu, and Feidao Cao.
Recognition Workshops, pages 1000–1001, 2020. 2, 7 Rtm3d: Real-time monocular 3d detection from ob-
[6] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ject keypoints for autonomous driving. arXiv preprint
ready for autonomous driving? the kitti vision benchmark arXiv:2001.03343, 2020. 1, 2, 7
suite. In 2012 IEEE Conference on Computer Vision and [20] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft ras-
Pattern Recognition, pages 3354–3361. IEEE, 2012. 5, 6, 7 terizer: A differentiable renderer for image-based 3d reason-
[7] Clément Godard, Oisin Mac Aodha, Michael Firman, and ing. The IEEE International Conference on Computer Vision
Gabriel J Brostow. Digging into self-supervised monocular (ICCV), Oct 2019. 7
depth estimation. In Proceedings of the IEEE international
[21] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. A general
conference on computer vision, pages 3828–3838, 2019. 1
differentiable mesh renderer for image-based 3d reasoning.
[8] Heiko Hirschmuller. Accurate and efficient stereo processing IEEE Transactions on Pattern Analysis and Machine Intelli-
by semi-global matching and mutual information. In 2005 gence, 2020. 7
IEEE Computer Society Conference on Computer Vision and
[22] Yuxuan Liu, Lujia Wang, and Liu Ming. Yolostereo3d: A
Pattern Recognition (CVPR’05), volume 2, pages 807–814.
step back to 2d for efficient stereo 3d detection. In 2021 In-
IEEE, 2005. 2
ternational Conference on Robotics and Automation (ICRA).
[9] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neu-
IEEE, 2021. 7
ral 3d mesh renderer. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 3907– [23] Yuxuan Liu, Yuan Yixuan, and Ming Liu. Ground-aware
3916, 2018. 7 monocular 3d object detection for autonomous driving. IEEE
Robotics and Automation Letters, 6(2):919–926, 2021. 7
[10] Jason Ku, Alex D Pon, and Steven L Waslander. Monoc-
ular 3d object detection leveraging accurate proposals and [24] Zechen Liu, Zizhang Wu, and Roland Tóth. Smoke: single-
shape reconstruction. In Proceedings of the IEEE Confer- stage monocular 3d object detection via keypoint estimation.
ence on Computer Vision and Pattern Recognition, pages In Proceedings of the IEEE/CVF Conference on Computer
11867–11876, 2019. 2 Vision and Pattern Recognition Workshops, pages 996–997,
[11] Abhinav Kumar, Garrick Brazil, and Xiaoming Liu. 2020. 1, 2, 7
Groomed-nms: Grouped mathematically differentiable nms [25] Feixiang Lu, Zongdai Liu, Xibin Song, Dingfu Zhou, Wei Li,
for monocular 3d object detection. In Proceedings of the Hui Miao, Miao Liao, Liangjun Zhang, Bin Zhou, Ruigang
IEEE/CVF Conference on Computer Vision and Pattern Yang, et al. Permo: Perceiving more at once from a single
Recognition, pages 8973–8983, 2021. 7 image for autonomous driving. arXiv e-prints, pages arXiv–
[12] Abhijit Kundu, Yin Li, and James M Rehg. 3d-rcnn: 2007, 2020. 5
Instance-level 3d object reconstruction via render-and- [26] Xinzhu Ma, Shinan Liu, Zhiyi Xia, Hongwen Zhang, Xingyu
compare. In Proceedings of the IEEE conference on com- Zeng, and Wanli Ouyang. Rethinking pseudo-lidar represen-
puter vision and pattern recognition, pages 3559–3568, tation. In European Conference on Computer Vision, pages
2018. 2 311–327. Springer, 2020. 7
[27] Xinzhu Ma, Zhihui Wang, Haojie Li, Pengbo Zhang, Wanli [39] Xinlong Wang, Wei Yin, Tao Kong, Yuning Jiang, Lei Li,
Ouyang, and Xin Fan. Accurate monocular 3d object detec- and Chunhua Shen. Task-aware monocular depth estimation
tion via color-embedded 3d reconstruction for autonomous for 3d object detection. arXiv preprint arXiv:1909.07701,
driving. In Proceedings of the IEEE International Confer- 2019. 2
ence on Computer Vision, pages 6851–6860, 2019. 7
[40] Xinshuo Weng and Kris Kitani. Monocular 3d object de-
[28] Xinzhu Ma, Yinmin Zhang, Dan Xu, Dongzhan Zhou, Shuai
tection with pseudo-lidar point cloud. In Proceedings of the
Yi, Haojie Li, and Wanli Ouyang. Delving into localization
IEEE International Conference on Computer Vision Work-
errors for monocular 3d object detection. In Proceedings of
shops, pages 0–0, 2019. 2
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 4721–4730, 2021. 7 [41] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor
[29] Fabian Manhardt, Wadim Kehl, and Adrien Gaidon. Roi- Darrell. Deep layer aggregation. In Proceedings of the
10d: Monocular lifting of 2d detection to 6d pose and metric IEEE conference on computer vision and pattern recogni-
shape. In Proceedings of the IEEE Conference on Computer tion, pages 2403–2412, 2018. 4, 7
Vision and Pattern Recognition, pages 2069–2078, 2019. 2
[42] Sergey Zakharov, Wadim Kehl, Arjun Bhargava, and Adrien
[30] Moritz Menze and Andreas Geiger. Object scene flow for au- Gaidon. Autolabeling 3d objects with differentiable render-
tonomous vehicles. In Proceedings of the IEEE conference ing of sdf shape priors. In Proceedings of the IEEE/CVF
on computer vision and pattern recognition, pages 3061– Conference on Computer Vision and Pattern Recognition,
3070, 2015. 2, 5 pages 12224–12233, 2020. 3, 5
[31] Arsalan Mousavian, Dragomir Anguelov, John Flynn, and
Jana Kosecka. 3d bounding box estimation using deep learn- [43] Yunpeng Zhang, Jiwen Lu, and Jie Zhou. Objects are differ-
ing and geometry. In Proceedings of the IEEE Conference ent: Flexible monocular 3d object detection. In Proceedings
on Computer Vision and Pattern Recognition, pages 7074– of the IEEE/CVF Conference on Computer Vision and Pat-
7082, 2017. 5 tern Recognition, pages 3289–3298, 2021. 7, 11
[32] Lu Qi, Li Jiang, Shu Liu, Xiaoyong Shen, and Jiaya Jia. [44] Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo
Amodal instance segmentation with kins dataset. In Pro- Yin, Yuchao Dai, and Ruigang Yang. Iou loss for 2d/3d ob-
ceedings of the IEEE/CVF Conference on Computer Vision ject detection. In 2019 International Conference on 3D Vi-
and Pattern Recognition, pages 3014–3023, 2019. 5, 7 sion (3DV), pages 85–94. IEEE, 2019. 5
[33] Rui Qian, Divyansh Garg, Yan Wang, Yurong You, Serge
Belongie, Bharath Hariharan, Mark Campbell, Kilian Q [45] Dingfu Zhou, Jin Fang, Xibin Song, Liu Liu, Junbo Yin,
Weinberger, and Wei-Lun Chao. End-to-end pseudo- Yuchao Dai, Hongdong Li, and Ruigang Yang. Joint 3d
lidar for image-based 3d object detection. arXiv preprint instance segmentation and object detection for autonomous
arXiv:2004.03080, 2020. 1 driving. In the IEEE/CVF Conference on Computer Vision
[34] Cody Reading, Ali Harakeh, Julia Chae, and Steven L and Pattern Recognition, pages 1839–1849, 2020. 1
Waslander. Categorical depth distribution network for [46] Dingfu Zhou, Vincent Frémont, Benjamin Quost, and Bihao
monocular 3d object detection. In Proceedings of the Wang. On modeling ego-motion uncertainty for moving ob-
IEEE/CVF Conference on Computer Vision and Pattern ject detection from a mobile platform. In IEEE Intelligent
Recognition, pages 8555–8564, 2021. 7 Vehicles Symposium Proceedings, pages 1332–1338, 2014.
[35] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointr- 1
cnn: 3d object proposal generation and detection from point
cloud. In IEEE/CVF Conference on Computer Vision and [47] Dingfu Zhou, Xibin Song, Yuchao Dai, Junbo Yin, Feixi-
Pattern Recognition, pages 770–779, 2019. 1 ang Lu, Miao Liao, Jin Fang, and Liangjun Zhang. Iafa:
Instance-aware feature aggregation for 3d object detection
[36] Xibin Song, Peng Wang, Dingfu Zhou, Rui Zhu, Chenye
from a single image. In Proceedings of the Asian Confer-
Guan, Yuchao Dai, Hao Su, Hongdong Li, and Ruigang
ence on Computer Vision, 2020. 2, 7
Yang. Apollocar3d: A large 3d car instance understand-
ing benchmark for autonomous driving. In Proceedings of [48] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
the IEEE/CVF Conference on Computer Vision and Pattern jects as points. arXiv preprint arXiv:1904.07850, 2019. 2, 4,
Recognition, pages 5452–5462, 2019. 2, 5 5
[37] Jean Marie Uwabeza Vianney, Shubhra Aich, and Bing-
[49] Yunsong Zhou, Yuan He, Hongzi Zhu, Cheng Wang,
bing Liu. Refinedmpl: Refined monocular pseudolidar for
Hongyang Li, and Qinhong Jiang. Monocular 3d object de-
3d object detection in autonomous driving. arXiv preprint
tection: An extrinsic parameter free approach. In Proceed-
arXiv:1911.09712, 2019. 1, 2, 7
ings of the IEEE/CVF Conference on Computer Vision and
[38] Li Wang, Liang Du, Xiaoqing Ye, Yanwei Fu, Guodong
Pattern Recognition, pages 7556–7566, 2021. 7
Guo, Xiangyang Xue, Jianfeng Feng, and Li Zhang. Depth-
conditioned dynamic message propagation for monocular 3d [50] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning
object detection. In Proceedings of the IEEE/CVF Confer- for point cloud based 3d object detection. In Proceedings
ence on Computer Vision and Pattern Recognition, pages of the IEEE Conference on Computer Vision and Pattern
454–463, 2021. 7 Recognition, pages 4490–4499, 2018. 1, 7
8. Supplemental Material Num. Kps. Kps. Confi.
Car 3D Det.
Easy Mod. Hard
8.1. Ablation Study for Keypoints Confidence Re- 16.49 12.31 10.54
16
X 19.59 14.50 11.88
gression
16.85 12.39 10.04
48
X 20.09 14.65 12.07
The 2D/3D keypoints regression is a critical component
in the proposed framework, however, inaccurate regression Table 3: Keypoints confidence ablation experiments on KITTI val
of these keypoints is inevitable in the real AD scenario due set using AP |R40 metric.
to many reasons e.g., viewpoint change, occlusion, and la-
beling noise, etc. Especially, these prediction outliers will
greatly affect the results of the linear system described in 8.2. Multi-classes Detection
Eq. 6. In order to handle this problem, we propose to pre-
Currently, the designed Autoshape model can’t generate
dict a confidence score for each keypoint and employ it as a
the keypoints annotation for “Pedestrian” and “Cyclist” due
weight for determining its contribution to the linear system.
to the lack of CAD models. Here, we simply transform the
To verify the effectiveness of the prediction confidence, we
3D keypoints from the mean “Car” template to the “Pedes-
set a series of ablations studies on the “Car” category.
trian” and “Cyclist” by normalize them first and re-scale
We give the results in Tab. 3. From this table, we can them to the bounding box’s size of other categories. By
see that the 3D object detection performance can be sig- generating these keypoints, then the object’s pose can eas-
nificantly improved by integrating the regressed key points ily solve as the “Car” category. We evaluate multi-class 3d
confidences. More importantly, this improvement is inde- detection on the KITTI test sever and the performances are
pendent of the number of keypoints. In addition, for further shown in Tab. 4. From this table, we can find that the pro-
understanding the actual meaning of this predicted confi- posed framework performs relatively well even though the
dence, we have visualized them in Fig. 9. Interestingly, keypoints annotation is not very accurate of “Pedestrian”
we find that these keypoints with high confidences usually and “Cyclist”. Interestingly, we find that the cyclist gives
come from the ground point (the intersection point between much better results than the “Pedestrian” and this is because
the tire and the ground) and these distinguished shape bor- the “Cyclist” can be considered as a rigid object to some ex-
der points. These points will give more contribution to the tent. On the contrary, the “Pedestrian” is a non-rigid object
object pose estimation. and the location of these keypoints varies a lot with different
object pose.

3D Det.
Methods Pedestrian Cyclist
Easy Mod. Hard Easy Mod. Hard
M3D-RPN[1] 4.92 3.48 2.94 0.94 0.65 0.47
MonoPair[4] 10.02 6.68 5.53 3.79 2.12 1.83
MonoFlex[43] 9.43 6.31 5.26 4.17 2.35 2.04
Ours 5.46 3.74 3.03 5.99 3.06 2.70

Table 4: Quantitative results for “Pedestrian“ and “Cyclist“ on


KITTI test set with AP |R40 metric.

Figure 9: Visualization of keypoints confidences. Here, the blue


represents score “1” and yellow represents score “0” and the color
changing from blue to yellow represents the confidence score de-
creasing from “1” to “0”. This figure is better to view in color
print.

You might also like