Autoshape: Real-Time Shape-Aware Monocular 3D Object Detection
Autoshape: Real-Time Shape-Aware Monocular 3D Object Detection
Zongdai Liu, Dingfu Zhou *, Feixiang Lu, Jin Fang and Liangjun Zhang
Robotics and Autonomous Driving Laboratory, Baidu Research,
National Engineering Laboratory of Deep Learning Technology and Application, China
{liuzongdai, zhoudingfu, lufeixiang, fangjin, liangjunzhang}@baidu.com
arXiv:2108.11127v1 [cs.CV] 25 Aug 2021
Abstract
4.3. Loss Function where pk and δk are the principal component direction and
corresponding standard deviation and sk is the coefficient of
The overall loss contains the following items: a center the kth principal component. Based on the vehicle template,
point classification loss lm and center point offset regres- we can automatically fit an optimal 3D shape to the visual
sion loss lof f , a 2D keypoints regression loss l2D , a 3D observations (details in Subsec. 5.2).
keypoints points regression loss l3D , an orientation multi-
bin loss lr , a dimension regression loss lD , a 3D IoU confi- 5.2. 3D Shape Optimization
dence loss lc and a 3D bounding box IoU loss lIoU . Specif-
For each vehicle, our goal is to assign a proper 3D shape
ically, the multi-task loss is defined as
to fit the visual observations, including the 2D instance
L = wm lm + wof f lof f + w2D l2D + w3D l3D + mask, the 3D bounding box, and the 3D LiDAR points.
(8)
wr lr + wD lD + wc lc + wIoU lIoU Specifically, the annotations of 2D instance mask Iins and
the 3D bounding box Bbox are provided by the KINS
where lm is the focal loss as used in [48], l2D is a depth-
dataset [32] and the KITTI dataset [6], respectively. The an-
guided l −1 loss as used in [17], lD and l3D are L1 loss with
notation of 3D LiDAR points for each vehicle is much more
respect to the ground truth. Orientation loss lr is the Multi-
complex. According to the labeled 3D bounding boxes, we
Bin loss. 3D IoU confidence lc is a binary cross-entropy
first segment out the individual 3D points from the entire
loss supervised by the IoU between the predicted 3D BBox
raw point cloud. Then we remove the ground points using
and ground truth. lIoU is the IoU loss between the predicted
the ground-plane estimation method (i.e. RANSAC-based
3D BBox and ground truth [44].
plane fitting). Finally, we obtain the “clean” 3D points for
each vehicle, which is represented as p = {p0 , ..., pk }.
5. 3D Shape Auto-Labeling
The 3D shape annotation is to compute the best PCA
In this section, we will introduce how to automatically fit s and the object 6-DoF pose (R,
coefficient b t). Existing
b b
the 3D shape to the visual observations and then automati- 3D object detection benchmarks (e.g., KITTI, Waymo) only
cally generate ground-truth annotations of 2D keypoints and label the yaw angle because they assume vehicles are on the
Figure 3: The pipeline of the proposed 3D shape-aware auto-labeling framework. By providing different kinds of vehicle CAD
models, a mean shape template and r principal basic can be obtained. By giving a 3D object sample in the KITTI dataset, the optimal
principal components with coefficient s and 6-DoF pose can be iteratively optimized by minimizing the 2D and 3D losses which are defined
on the scanned sparse point cloud and the instance mask.
Figure 4: Illustration of the 3D shape optimization process Figure 5: An example of CAD model-fitting results with one
from step 0 to step 200. From the top to bottom, the rendered angle (yaw) only or three angles (yaw, roll, and pitch). When
mask gradually covers the target mask and 3D model vertices align the road is not flat, serious misalignment will happen if only yaw
with the point cloud gradually and the 3D deformed shape changes angle is applied for optimization.
from a mean shape to a ‘notchback’.
3D Det.
Methods Pedestrian Cyclist
Easy Mod. Hard Easy Mod. Hard
M3D-RPN[1] 4.92 3.48 2.94 0.94 0.65 0.47
MonoPair[4] 10.02 6.68 5.53 3.79 2.12 1.83
MonoFlex[43] 9.43 6.31 5.26 4.17 2.35 2.04
Ours 5.46 3.74 3.03 5.99 3.06 2.70