Real-Time Depth Enhanced Monocular Odometry: Ji Zhang, Michael Kaess, and Sanjiv Singh
Real-Time Depth Enhanced Monocular Odometry: Ji Zhang, Michael Kaess, and Sanjiv Singh
Real-Time Depth Enhanced Monocular Odometry: Ji Zhang, Michael Kaess, and Sanjiv Singh
4974
Fig. 3. Block diagram of the visual odometry software system.
4976
types of environments shown in Fig. 6: a conference room,
a large lobby, a clustered road, and a flat lawn. The difficulty
increases over the tests as the environments are opener
Fig. 5. Illustration of transform integration. The curves represent trans-
forms. The green colored segment is refined by the bundle adjustment and and depth information changes from dense to sparse. We
become the blue colored segment, published at a low frequency. The orange present two images from each dataset, on the 2nd and 4th
colored segments represent frame to frame motion transforms, generated at rows in Fig. 6. The red colored areas indicate coverage of
a high frequency. The transform integration step takes the orange segment
from the green segment and connects it to the blue segment. This results in depth maps, from the RGB-D camera (right figure) and the
integrated motion transforms published at the high frequency. lidar (left figure). Here, note that the depth map registers
depth images from the RGB-D camera or point clouds from
For features without depth, zik is set at a default value. Let the lidar at multiple frames, and usually contains more
J be the set of image frames in the sequence, and let l information than that from a single frame. With the RGB-
be the first frame in the set. Upon initialization, all features D camera, the average amount of imaged area covered by
appear in the sequence are projected into {C l }, denoted as the depth map reduces from 94% to 23% over the tests.
l l
X̃i , i ∈ I. Define Tlj as the transform projecting X̃i from The lidar has a longer detection range and can provide more
{C l } to {C j }, where j is a different frame in the sequence, depth information in open environments. The depth coverage
j ∈ J \ {l}. The bundle adjustment minimizes the following changes from 89% to 47% of the images.
function by adjusting the motion transform between each two The camera frame rate is set at 30Hz for both sensors.
l
consecutive frames and the coordinates of X̃i , To evenly distribute the features within the images, we
X j l j l j
separate an image into 3 × 5 identical subregions. Each
min (Tl (X̃i ) − X̃i )T Ωji (Tlj (X̃i ) − X̃i ), subregion provides maximally 30 features, giving maximally
i,j 450 features in total. The method is compared to two popular
i ∈ I, j ∈ J \ {l}. (11) RGB-D visual odometry methods. Fovis estimates the motion
j of the camera by tracking image features, and depth is
Here, X̃i represents the observation of feature i at frame j, associated to the features from the depth images [20]. DVO
and Ωji is its information matrix. The first two entries on is a dense tracking method that minimizes the photometric
the diagonal of Ωji are given constant values. If the depth is error within the overall images [23]. Both methods use data
from the depth map, the 3rd entry is at a larger value, and from the RGB-D camera. Our method is separated into two
if the depth is from triangulation, the value is smaller and versions, using the two sensors in Fig. 2, respectively. The
is inversely proportional to the square of the depth. A zero resulting trajectories are presented on the 1st and 3rd rows
value is used for features with unknown depth. in Fig. 6, and the accuracy is compared in Table I, using
The bundle adjustment publishes refined motion trans- errors in 3D coordinates. Here, the camera starts and stops
forms at a low frequency. With the camera frame rate at the same position, and the gap between the two ends
between 10-40Hz, the bundle adjustment runs at 0.25-1.0Hz. of a trajectory compared to the length of the trajectory is
As illustrated in Fig. 5, a transform integration step takes considered the relative position error.
the bundle adjustment output and combines it with the high From these results, we conclude that all four methods
frequency frame to frame motion estimates. The result is function similarly when depth information is sufficient (in
integrated motion transforms published at the high frequency the room environment), while the relative error of DVO
as the frame to frame motion transforms. is slightly lower than the other methods. However, as the
VII. E XPERIMENTS depth information becomes sparser, the performance of Fovis
and DVO reduces significantly. During the last two tests,
The visual odometry is tested with author-collected data
Fovis frequently pauses without giving odometry output due
and the KITTI benchmark datasets. It tracks Harris corners
to insufficient number of inlier features. DVO continuously
[26] by the Kanade Lucas Tomasi (KLT) method [31]. The
generates output but drifts heavily. This is because both
program is implemented in C++ language, on robot operating
methods use only imaged areas where depth is available,
system (ROS) [32] in Linux. The algorithms run on a laptop
leaving large amount of areas in the visual images being
computer with 2.5GHz cores and 6GB memory, using around
unused. On the other hand, the two versions of our method
three cores for computation. The feature tracking and bundle
adjustment (Section VI) take one core each, and the frame TABLE I
to frame motion estimation (Section V) and depth map R ESULTS USING AUTHOR - COLLECTED DATA . T HE ERROR IS MEASURED
registration together consume another core. Our software AT THE END OF A TRAJECTORY AS A % OF THE DISTANCE TRAVELED
code and datasets are publicly available1 , in two different
versions based on the two sensors in Fig. 2. Relative position error
Envir- Dist- Our VO Our VO
A. Tests with Author-collected Datasets onment ance Fovis DVO (RGB-D) (Lidar)
We first conduct tests with author-collected datasets using Room 16m 2.72% 1.87% 2.14% 2.06%
Lobby 56m 5.56% 8.36% 1.84% 1.79%
the two sensors in Fig. 2. The data is collected from four
Road 87m 13.04% 13.60% 1.53% 0.79%
1 wiki.ros.org/demo_rgbd and wiki.ros.org/demo_lidar Lawn 86m 9.97% 32.07% 3.72% 1.73%
4977
(a) (b)
(g) (h)
are able to maintain accuracy in the tests, except that the vehicle is equipped with color stereo cameras, monochrome
relative error of the RGB-D camera version is relatively large stereo cameras, a 360◦ Velodyne laser scanner, and a high
in the lawn environment, because the depth is too sparse accuracy GPS/INS for ground truth. Both image and laser
during the turning on top of Fig. 6(h). data are logged at 10Hz. The image resolution is around
1230 × 370 pixels, with 81◦ horizontal field of view. Our
B. Tests with KITTI Datasets
method uses the imagery from the left monochrome camera
The proposed method is further tested with the KITTI and the laser data, and tracks maximally 2400 features from
datasets. The datasets are logged with sensors mounted on 3 × 10 identical subregions in the images.
the top of a passenger vehicle, in road driving scenarios. The
4978
(a) (b) (c)
TABLE II
C ONFIGURATIONS AND RESULTS OF THE KITTI DATASETS . T HE ERROR
IS MEASURED USING SEGMENTS OF A TRAJECTORY AT 100 M , 200 M , ...,
800 M LENGTHES , AS AN AVERAGED % OF THE SEGMENT LENGTHES .
4979
The 11 datasets are manually separated into segments and [11] D. Nister, O. Naroditsky, and J. Bergen, “Visual odometry for ground
labeled with an environment type. For each environment, vechicle applications,” Journal of Field Robotics, vol. 23, no. 1, pp.
3–20, 2006.
the visual odometry is tested with and without the bundle [12] M. Maimone, Y. Cheng, and L. Matthies, “Two years of visual
adjustment. Fig. 8 shows the distributions of the relative odometry on the mars exploration rovers,” Journal of Field Robotics,
errors. Overall, the bundle adjustment helps reduce the mean vol. 24, no. 2, pp. 169–186, 2007.
[13] A. Howard, “Real-time stereo visual odometry for autonomous ground
errors by 0.3%-0.7%, and seems to be more effective in urban vehicles,” in IEEE International Conference on Intelligent Robots and
and country scenes than on highways, partially because the Systems, Nice, France, Sept 2008.
feature quality is lower in the highway scenes. [14] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3D recon-
struction in real-time,” in IEEE Intelligent Vehicles Symposium, Baden-
VIII. C ONCLUSION AND F UTURE W ORK Baden, Germany, June 2011.
[15] L. Paz, P. Pinies, and J. Tardos, “Large-scale 6-DOF SLAM with
The scenario of insufficiency in depth information is stereo-in-hand,” IEEE Transactions on Robotics, vol. 24, no. 5, pp.
common for RGB-D cameras and lidars which have limited 946–957, 2008.
[16] R. Newcombe, A. Davison, S. Izadi, P. Kohli, O. Hilliges, J. Shotton,
ranges. Without sufficient depth, solving the visual odometry D. Molyneaux, S. Hodges, D. Kim, and A. Fitzgibbon, “KinectFusion:
is hard. Our method handles the problem by exploring both Real-time dense surface mapping and tracking,” in IEEE International
visual features whose depth is available and unknown. The Symposium on Mixed and Augmented Reality, 2011, pp. 127–136.
[17] N. Engelhard, F. Endres, J. Hess, J. Sturm, and W. Burgard, “Real-
depth is associated to the features in two ways, from a depth time 3D visual SLAM with a hand-held RGB-D camera,” in RGB-D
map and by triangulation using the previously estimated Workshop on 3D Perception in Robotics at the European Robotics
motion. Further, a bundle adjustment is implemented which Forum, 2011.
[18] T. Whelan, H. Johannsson, M. Kaess, J. Leonard, and J. McDonald,
refines the frame to frame motion estimates. The method “Robust real-time visual odometry for dense RGB-D mapping,” in
is tested with author-collected data using two sensors and IEEE International Conference on Robotics and Automation, 2013.
the KITTI benchmark datasets. The results are compared [19] I. Dryanovski, R. Valenti, and J. Xiao, “Fast visual odometry and
mapping from RGB-D data,” in IEEE International Conference on
to popular visual odometry methods, and the accuracy is Robotics and Automation (ICRA), Karlsruhe, Germany, 2013.
comparable to state of the art stereo methods. [20] A. Huang, A. Bachrach, P. Henry, M. Krainin, D. Maturana, D. Fox,
Considering future work, the current method uses Harris and N. Roy, “Visual odometry and mapping for autonomous flight
using an RGB-D camera,” in Int. Symposium on Robotics Research
corners tracked by the KLT method. We experience difficulty (ISRR), 2011.
of reliably tracking features in some indoor environments, [21] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “RGB-D
such as a homogenously colored corridor. Improvement of mapping: Using kinect-style depth cameras for dense 3D modeling of
indoor environments,” The International Journal of Robotics Research,
feature detection and tracking is needed. Further, the method vol. 31, no. 5, pp. 647–663, 2012.
is currently tested with depth information from RGB-D [22] S. Rusinkiewicz and M. Levoy, “Efficient variants of the ICP algo-
cameras and lidars. In the future, we will try to utilize depth rithm,” in Third International Conference on 3D Digital Imaging and
Modeling (3DIM), Quebec City, Canada, June 2001.
provided by stereo cameras, and possibly extend the scope [23] C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for
of our method to stereo visual odometry. RGB-D cameras,” in IEEE International Conference on Robotics and
Automation, Karlsruhe, Germany, 2013.
R EFERENCES [24] J. Sturm, E. Bylow, C. Kerl, F. Kahl, and D. Cremer, “Dense tracking
and mapping with a quadrocopter,” in Unmanned Aerial Vehicle in
[1] G. Klein and D. Murray, “Parallel tracking amd mapping for small Geomatics (UAV-g), Rostock, Germany, 2013.
AR workspaces,” in Proc. of the International Symposium on Mixed [25] G. Hu, S. Huang, L. Zhao, A. Alempijevic, and G. Dissanayake, “A
and Augmented Reality, Nara, Japan, Nov. 2007, pp. 1–10. robust RGB-D slam algorithm,” in IEEE/RSJ International Conference
[2] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense on Intelligent Robots and Systems, Vilamoura, Portugal, Oct. 2012.
tracking and mapping in real-time,” in IEEE International Conference [26] R. Hartley and A. Zisserman, Multiple View Geometry in Computer
on Computer Vision, 2011, pp. 2320–2327. Vision. New York, Cambridge University Press, 2004.
[3] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry for [27] R. Murray and S. Sastry, A mathematical introduction to robotic
a monocular camera,” in IEEE International Conference on Computer manipulation. CRC Press, 1994.
Vision (ICCV), Sydney, Australia, Dec. 2013. [28] R. Andersen, “Modern methods for robust regression.” Sage University
[4] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct Paper Series on Quantitative Applications in the Social Sciences, 2008.
monocular visual odometry,” in IEEE International Conference on [29] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf,
Robotics and Automation (ICRA), May 2014. Computation Geometry: Algorithms and Applications, 3rd Edition.
[5] D. Scaramuzza, “1-point-ransac structure from motion for vehicle- Springer, 2008.
mounted cameras by exploiting non-holonomic constraints,” Interna- [30] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. Leonard, and F. Del-
tional Journal of Computer Vision, vol. 95, p. 7485, 2011. laert, “iSAM2: Incremental smoothing and mapping using the Bayes
[6] S. Weiss, M. Achtelik, S. Lynen, M. Achtelik, L. Kneip, M. Chli, and tree,” International Journal of Robotics Research, vol. 31, pp. 217–
R. Siegwart, “Monocular vision for long-term micro aerial vehicle 236, 2012.
state estimation: A compendium,” Journal of Field Robotics, vol. 30, [31] B. Lucas and T. Kanade, “An iterative image registration technique
no. 5, pp. 803–831, 2013. with an application to stereo vision,” in Proceedings of Imaging
[7] P. Corke, D. Strelow, and S. Singh, “Omnidirectional visual odometry Understanding Workshop, 1981, pp. 121–130.
for a planetary rover,” in Proc. of the IEEE/RSJ International Con- [32] M. Quigley, B. Gerkey, K. Conley, J. Faust, T. Foote, J. Leibs,
ference on Intelligent Robots and Systems, Sendai, Japan, Sept. 2004, E. Berger, R. Wheeler, and A. Ng, “ROS: An open-source robot
pp. 149–171. operating system,” in Workshop on Open Source Software (Collocated
[8] K. Konolige, M. Agrawal, and J. Sol, “Large-scale visual odometry with ICRA 2009), Kobe, Japan, May 2009.
for rough terrain,” Robotics Research, vol. 66, p. 201212, 2011. [33] A. Y. H. Badino and T. Kanade, “Visual odometry by multi-frame
[9] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous feature integration,” in Workshop on Computer Vision for Autonomous
driving? The kitti vision benchmark suite,” in IEEE Conference on Driving (Collocated with ICCV 2013), Sydney, Australia, 2013.
Computer Vision and Pattern Recognition, 2012, pp. 3354–3361. [34] H. Badino and T. Kanade, “A head-wearable short-baseline stereo
[10] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: system for the simultaneous estimation of structure and motion,” in
The KITTI dataset,” International Journal of Robotics Research, IAPR Conference on Machine Vision Application, Nara, Japan, 2011.
no. 32, pp. 1229–1235, 2013.
4980