1. Introduction
Object detection in two-dimensional (2D) images is a classic image processing problem. With the recent advances in deep learning, many deep learning-based 2D object detection models have been proposed. There are generally two types of deep learning-based object detection—one-stage and two-stage [
1]. A two-stage detector can achieve high accuracy at the expense of a high computational load. A one-stage detector has faster speed with a little lower accuracy. For some applications, it is better to detect objects in three-dimensional (3D) space. Due to the absence of depth cues in 2D image, monocular 3D object detection is challenging research.
3D object detection methods estimate the 3D-oriented bounding box for each object. The current research on point cloud-based 3D detection algorithms, with input from light detection and ranging (LiDAR) or depth cameras (RGB-D), has made great progress. Alternatively, monocular 3D object detection systems achieve the aforementioned goal with the input of a single 2D image. LiDAR-based methods are fast and accurate. However, the systems are expensive and complicated to set up. Monocular 3D object detection methods, depending on the number of 3D points estimated, are slower. As the main piece of equipment is a video camera, the systems are of low-cost and easy to set up. Similarly to the categorization of 2D object detection models, Kim and Hwang [
2] also grouped monocular 3D object detection systems into two main categories—the multi-stage approach and the end-to-end approach. Image-based object detection has gained applications in many fields, such as autonomous driving and robot vision. Intelligent transportation systems demand the localization of pedestrians and vehicles. Chen et al. [
3] presented a detailed analysis of many deep learning-based frameworks for pedestrian and vehicle detection. Arnold et al. [
4] presented the first survey on 3D object detection for autonomous driving.
Acquiring 3D object bounding boxes from monocular image is an ill-posed problem because 2D images lack the depth information of the objects, which is essential for object localization in 3D space. To solve this missing depth problem, a natural solution [
5] is to use convolutional neural networks (CNNs) to regress the depth information. The estimated depth map, with one more dimension, provides additional information for 3D detection. This is referred as the pseudo-LiDAR approach. Although depth estimation is helpful for 3D target detection, the capabilities of monocular depth estimation algorithms is quite limited.
Moreover, a recent study [
6] concluded that inaccurate depth estimation is not the only reason for the low accuracy of monocular 3D target detection. It is also due to the fact that the depth map is not suitable for 3D problems. Following this argument, other methods [
7,
8] were proposed to convert the depth map into a pseudo-point cloud representation. With this 3D data input, a point cloud-based approach can be adopted for 3D detection. Significant performance improvements on the KITTI dataset [
9] can be achieved. However, there is still a big gap between the performances of pseudo-LiDAR-based methods and ones based on real 3D point clouds. We conducted thorough experiments and observed that this gap is due to the fact that even high-ranking depth estimation algorithms produce depth maps in which the depths of the foreground objects are still extremely inaccurate. To address this issue, we therefore, focused more on improving the depth accuracy of foreground objects in the depth map.
Figure 1a shows the deviation between the pseudo-LiDAR point cloud of a car estimated by the monocular depth estimation method DORN [
5] and the ground truth location (the green box). With our proposed depth refinement, as shown in
Figure 1b, the pseudo-LiDAR point cloud of the car can be dragged back to the rightful location.
Figure 1 demonstrates the result of our depth distribution adjustment. Our proposed method is able to fix the long-tail problem, as shown in
Figure 1a,b.
Prior knowledge of an object’s physical size, scene information and the the imaging process of the camera are crucial for depth recovery. Such geometric priors have been exploited to predict objects’ poses or distances. Mono RCNN [
10] and GUPNet [
11] were proposed. They use the geometric constraint relationship between the physical height of the target and the projected visual height to predict the position of the target more accurately. However, the accuracy of these algorithms is severely affected by the inferred physical height of the target and the pixel height of the target in the image. They suffer from a large number of missed detections of distant targets (more than 50 m).
Figure 2 shows a comparison of depth estimation errors (deviations of depth from the center of the object) among the state-of-the-art monocular depth estimation method [
5], geometric-constraint-based method [
11] and ground truth on the KITTI validation (val) split. DORN [
5] can estimate the depths of objects nearby and far away. However, the error increases progressively with the distance of the target. GUPNet [
11] can estimate the depths of near targets with reasonably little error. However, the method fails when objects are far away, specifically further than 50 m. As shown in
Figure 2, there are no detected depth data at distances beyond 50 m. Other algorithms [
7,
8] were proposed to solve this problem by forcing deep neural networks to learn the offsets of pseudo-LIDAR point clouds with inaccurate locations. We argue that the displacement of pseudo-LiDAR is not the natural form of objects, and thus learning the noisy representation of pseudo-LiDAR will lead to constant failures in real world problems. A learned refinement of pseudo-LiDAR is beneficial for a monocular 3D detection task. In our work, inspired by geometric constraints, we propose a novel depth refinement framework to predict object location, in which geometry information serves as a priori information. Another problem is that the inaccuracy of depth estimation at the edges of objects will lead to the long-tail problem in the depth distribution of foreground targets. Large amounts of noise will appear when we transform the estimated depth map into a pseudo-point cloud. Therefore, we try to limit the long-tail problem around objects by rightfully replacing the pseudo-LiDAR along the contours of objects in the depth map.
To solve the aforementioned problems, we propose a two-stage method to alleviate the 3D location error and the long-tail noise of objects. The advantages of our proposed method are two-folded. Specifically, by adopting geometry information in scene context as a constraint, we fuse object segmentation, depth estimation and the physical shapes of objects jointly as a priori information to refine the depths of foreground objects in our depth maps. Additionally, using the prior knowledge of objects’ physical shapes, we adjust the distribution of the pseudo-LiDAR point cloud accordingly. We propose a simple yet effective class-specific normalization strategy. Finally, a 3D object detection method based on point cloud is introduced into our pipeline to obtain 3D detection results.
A thorough evaluation of our proposed method was conducted. Both qualitative and quantitative results show that, by introducing geometric constraints as a priori information and fusing this prior information with object segmentation and depth estimation, the performance in predicting the depths of foreground objects can be significantly improved. The method can produce higher accuracy by 15% on the KITTI validation dataset than other methods. The class-specific normalization strategy is able to produce better pseudo-LiDAR point cloud representations, and led to better performance by 10% on the KITTI validation dataset, which includes small objects, such as pedestrians.
In summary, we propose a novel monocular 3D object detection framework that can improve the robustness of the pseudo-LiDAR-based method with the input of a single image. Our main contributions are as follows:
We found that the main limitations of monocular 3D detection are due to the inaccuracy of the target position and the uncertainty of the depth distribution of the foreground target. These two problems arise from inaccurate depth estimation.
We first propose an innovative method based on joint image segmentation and geometric-constraint-based target-guided depth adjustment to predict the target depth and provide the depth prediction confidence measure. The accuracy of the predicted target depth in the depth map is improved.
We utilize the prior target size and normalization strategy to tackle the long-tail noise problem in pseudo-LiDAR point clouds. The uncertainly of depth distribution is reduced.
Thorough experimentation has been carried out with the KITTI dataset. With the two novel solutions, our proposed monocular 3D object detection framework outperforms various state-of-the-art methods.
The paper is structured as follows. The related studies are reviewed in
Section 2. We focus on various monocular 3D object detection models. In addition, publicly available datasets that were generated to facilitate research are introduced.
Section 3 describes our proposed deep learning framework. We evaluate our framework and compare its performance with those of state-of-the art methods.
Section 4 presents the experimental results and comparative analysis. Finally, in
Section 5, we draw the conclusions and outline future research directions.
3. Proposed Framework
An overview of our proposed framework is shown in
Figure 3. A single RGB image is fed into two separate deep learning models: the monocular depth estimation network and the 2D instance segmentation network. The coarse depth map is improved by the depth distribution adjustment (DDA) module. On the other hand, the depth map and mask proposal are used to refine the depth information of instance segmentation with the geometric constraint depth refinement (GCDR) module. Inspired by GUPNet [
11], we specifically replace the 2D bounding box priors with the 2D segmentation. The long-tail problem in the depth data is solved by reducing the influence of large depth error on the boundary of the object. With the depth data enhanced by these two modules and the camera matrix, a better 3D representation (pseudo-LiDAR point cloud) is generated. Finally, we use an off-the-shelf LiDAR-based 3D detection algorithm to infer the 3D detection results (location, scale and bounding box). The following sub-sections describe 2D instance segmentation, monocular depth estimation, depth refinement, pseudo-LiDAR point cloud generation and 3D object detection, in detail.
3.1. 2D Instance Segmentation
Pseudo-LiDAR-based monocular 3D detection tends to suffer from the problem of low accuracy. This is because, unlike real LiDAR data, pseudo-LiDAR data are generated from the depth map, which is estimated from a single RGB image. The estimated depth map has pixel-level density, and the information is insufficient. As the 2D object proposal serves as a prior, its estimation needs to be robust in order to improve the accuracy of 3D detection. Within the 2D bounding box in a depth map, there are redundant points between the boundaries of object and box. LiDAR-based 3D detection algorithms usually perform point cloud segmentation. The redundant points will certainly confuse the point cloud segmentation and affect the 3D detection. To remove the interference of redundant points, we use 2D instance segmentation instead of a 2D bounding box. We compare the point clouds converted from the refined depth map using 2D segmentation and a 2D bounding box in
Figure 4. As displayed in the figure, by refining all the points using our proposed method within the 2D bounding box, redundant points lying between boundaries of the bounding box and segmentation are also dragged back, mistakenly, into the object’s 3D location. With 2D segmentation, the redundant points not enclosed by the ground truth location are largely removed. The use of 2D instance segmentation will result in a better pseudo-LiDAR point cloud for the subsequent 3D object detection. Specifically, we use the anchor-free CenterMask [
34] as our instance segmentation network. It produces more accurate and robust 2D instance segmentation. Moreover, it is more capable of tackling partially occluded objects.
3.2. Monocular Depth Estimation
To convert the input RGB image into 3D space, a monocular depth estimation algorithm can provide the distance information for 3D object detection. We use the state-of-the-art monocular depth estimation algorithm DORN [
5] as a sub-network in our framework. This pre-trained network is utilized as an offline module and integrated with other parts of our framework. Since our proposed framework is agnostic to different depth estimation algorithms, one can replace DORN with any other algorithms, if necessary, without affecting the final results.
3.3. Object-Guided Depth Refinement
Figure 5 shows the architecture of the proposed object-guided depth refinement module. Aiming to guide the model to refine the depth estimation only for foreground objects, we combine the pre-computed depth map and 2D instance segmentation to generate region of interest (RoI) features according to each predicted 2D object. We follow the geometry projection law of pinhole camera, in which the physical height of the object is needed. We modify and re-train CenterMask [
34] by adding a sub-head to regress the 3D height with uncertainty with a Laplacian distribution for each object accordingly. To cope with the problems that both depth estimation and monocular 3D detection tend to struggle with when the objects are far away, we propose a penalty factor generated from the discretization of the depth distribution in order to optimize the learning process of our model, and to better predict the depth information of distant objects.
Directly regressing depth from a 2D detector is a hard and yet unreasonable problem. Inspired by GUPNet [
11], we propose a geometric constraint depth refinement (GCDR) module to not only refine the depth of near objects, but also optimize the depth estimation of distant objects. Instead of avoiding the acquisition of depth from the 2D image, as in [
11], we simply learn the depth uncertainty in a probability framework by geometry projection based on the assumptions from the depth map and 2D instance segmentation. In this way, we can enhance the depth refinement reliability.
As a prerequisite, a prediction of 3D height for each object is generated by adding a sub-head to the CenterMask [
34] base model. We assume that the 3D height has uncertainty in a Laplacian distribution, as we simply do not fully trust the reliability of this information. The probability density function of a Laplace random variable
is
, where
and
are the location and the diversity parameter. Thus, the mean and deviation of such a distribution are
. We re-train the CenterMask [
34] model to predict the mean
and the standard deviation
, which represent the direct regression 3D height and the uncertainty of it, respectively.
To better predict distant objects, which is the bottleneck of current monocular depth estimation and 3D detection methods, we introduce a distance-sensitive factor (DSF)
as a penalty to force the network to focus on faraway objects. We adopt a depth discretization strategy named linear interpolate discretization (LID) [
10], as shown in
Figure 6. The depth values are quantized in a way that the further the object, the larger the range of the depth bin the object may fall in, which is specifically calculated as follows:
where
is the continuous depth value,
is the full depth range of an object,
D is the number of depth bins to be separated and
i is the depth bin index. Using this discretization strategy, the network can be more tolerant with the depth error and the DSF can be smoother. The DSF is calculated as follows:
where
is the mean depth value in the depth bin in which the object is located. The exact depth of the object is calculated by adding half of the empirical object length and the mean depth at its 2D mask center. The loss function used to train the model to generate 3D physical height of each object is as follows:
where
is the ground truth physical height of the object,
is the raw output of 3D regression head, and
is the uncertainty of the result. With the introduction of DSF, we may successfully alleviate the low confidence caused by the long distances of faraway objects, leaving the large
to indicate only the noisy labels or hard objects. With the prediction distribution of 3D height
H, the depth distribution can be approximated using the geometry projection function:
where
h is the projected height of the object in image,
f is the focal length of the camera,
are the parameters of the predicted Laplace distribution,
is the 3D physical height and
X is the standard Laplace distribution
. An additional bias generated from the pre-computed depth map is used to initialize the projection results. Its mean value
is also computed by combining the mean depth of the 2D mask center
and half of the empirical object length
L as follows:
is the learned standard deviation multiplied by
. Then, the final Laplace distribution of predicted depth
d is computed as:
where
The final uncertainty of depth distribution contains both the projection uncertainty and the learned bias uncertainty, which can be optimized in our GCDR module using the depth refinement loss:
where
is the ground truth depth value. Note that all the calculations are optimized also by
, which allows the proposed method to be agnostic with respect to the distance of object.
GCDR aims to optimize depth accuracy in the depth map, as we crave the potential of using a 2D image to predict depth information in a geometric manner. For the convenience of our further adjustments in the depth map, a confidence score is necessary to indicate whether the result is trustworthy. Since our final distance non-sensitive standard deviation has the capability of indicating the uncertainty of depth without considering the distance of the object, we further project its value into the
space via an exponential function to generate the depth confidence score:
where
represents the weights. We treat the object depth
in the depth map and the geometric constraint refined depth
d with different levels of emphasis. The final refined depth of the object is calculated as follows:
Finally, we adjust the depth information within the 2D segmentation of each object using the final refined depth, and thus finish the refinement. Our proposed method adopts the geometric projection to avoid direct depth regression from the 2D image as a compensation measure. It alleviates the influence of large distances, utilizes the potential of a single 2D image, and enhances the depth estimation of both near and faraway objects.
3.4. Depth Distribution Adjustment
GCDR mainly addresses the problem of objects’ misplacement in the depth dimension, by refining and moving back objects into their correct locations. Still, using a depth map to generate the pseudo-LiDAR point cloud still has the problem of distortion. The object will be stretched out far beyond its actual physical proportions, which is called the long-tail problem. It is mainly caused by inaccurate depth estimation blurring around the object boundary. The object boundaries in pseudo-LiDAR are crucial, as they represent the shape of the object.
In order to minimize the influence of wrongly estimated depth around the boundary of an object, a depth distribution adjustment (DDA) module is proposed to adjust points with weak depth estimation into their correct locations via the adjustment of the distribution of depth information within each object’s mask. We trust 2D segmentation, as a priori knowledge that can be utilized to determine precisely the pixels that belong to the object. By observing the discretization depth of each pixel within a 2D mask, we assume that the depth distribution of each object follows the Poisson distribution. To solve the long-tail problem is to narrow the distribution of pixels in the mask. First, we set the peak of the Poisson distribution
as our anchor for further adjustment. Then, using the target scale, as in the empirical length of the object accordingly as the standard deviation
, we can map the Poisson distribution to a discrete Gaussian distribution with mean and standard deviation
and
, respectively, as follows:
In this way, the distortion of depth around the boundary of each object can be well removed, providing better shape representation after converting the information to a pseudo-LiDAR point cloud.
3.5. Pseudo-LiDAR Generation
Pseudo-LiDAR-based 3D detection methods usually take advantage of well-developed LiDAR-based 3D detection methods. Thus, integrating information at hand and mimicking the LiDAR signal is essential. With the help of the camera calibration files provided by public datasets, such as KITTI [
24] and nuScenes [
26], we can easily transform depth maps into point clouds. Given pixel coordinates in 2D image space
and the estimated depth
d, the 3D coordinates
in the camera coordinate system can be computed as:
where
and
are the focal length of the camera in
x and
y axes, respectively, and
is the principal point. With the additional extrinsic matrix
, we can also acquire the 3D location of each point in the world coordinate system by:
where
C is camera parameter calculated by intrinsic and extrinsic parameters. Parameters of the calculation are provided by the dataset accordingly, which will be shown in
Section 4.
3.6. 3D Object Detection
The pseudo-LiDAR point cloud generated from the refined depth map is then fed into a LiDAR-based 3D detection algorithm for 3D bounding box estimation. Note that our framework does not require a specific 3D detection algorithm. In this study, we experimented with two different methods, F-PointNet [
18] and PVRCNN [
35]. For F-PointNet [
18], we replaced the approach of adopting a 2D bounding box to generate a frustum proposal with a 2D instance segmentation frustum. As for PVRCNN [
35], no adjustments were implemented.
5. Conclusions
Detection of 3D objects from a single 2D image is a challenging task. Generally, the predicted target location is not very accurate, and the depth data of the target are noisy. These two problems arise from the inaccurate depth estimation. In this paper, we proposed a monocular 3D object detection framework with two innovative solutions. First, we proposed the GCDR module to predict the target depth and provide the depth prediction confidence measure. With the integration of 2D instance segmentation and the coarse depth map, the predicted target location is improved. Second, we proposed the DDA module. By the use of the target scale information, DDA can adjust the depth data distribution and reduce the long-tail noise in the point cloud. With the refined depth information, the 3D representation of the scene, called the pseudo-LiDAR point cloud, is generated. Finally, we use a LiDAR-based algorithm to detect the 3D target. We conducted extensive experiments to investigate the significance of the two proposed modules. We evaluated the performance of the proposed framework on the KITTI dataset. Our proposed framework outperformed various state-of-the-art methods by more than 12.37% and 5.34% on the easy and hard settings of the KITTI validation subset, respectively. For the KITTI test set, our framework outperformed other methods by more than 5.1% and 1.76% on the easy and hard settings, respectively.
In our future work, we would like to conduct more experiments on other datasets, such as nuScenes [
26]. Although our proposed framework outperforms many state-of-the-art monocular 3D object detection methods, there is still room for improvement. Since our method relies on the quality of 2D masks, and generating masks for partially occluded objects remains an unsolved problem, training an occlusion-aware segmentation method may be benefit for our pipeline. Another direction for improving monocular 3D detection may be trying to improve the inference strategy of depth prediction. As humans, we can infer the distance of one object if there is a known object in view. If we can train a model to infer object-related distance with the depth information of only a few learned objects, then its monocular depth estimation and 3D detection may be improved.