Real-Time Depth Enhanced Monocular Odometry: Ji Zhang, Michael Kaess, and Sanjiv Singh

2014 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2014)

September 14-18, 2014, Chicago, IL, USA
Real-time Depth Enhanced Monocular Odometry

Ji Zhang, Michael Kaess, and Sanjiv Singh
Abstract— Visual odometry can be augmented by depth in-

formation such as provided by RGB-D cameras, or from lidars
associated with cameras. However, such depth information can
be limited by the sensors, leaving large areas in the visual
images where depth is unavailable. Here, we propose a method
to utilize the depth, even if sparsely available, in recovery
of camera motion. In addition, the method utilizes depth by
triangulation from the previously estimated motion, and salient
visual features for which depth is unavailable. The core of
our method is a bundle adjustment that refines the motion (a) (b)
estimates in parallel by processing a sequence of images, in
Fig. 1. (a) Features tracked at an image frame. The green dots represent
a batch optimization. We have evaluated our method in three
features whose depth comes from the depth map, the blue dots represent
sensor setups, one using an RGB-D camera, and two using features whose depth is determined by triangulation using the previously
combinations of a camera and a 3D lidar. Our method is rated estimated motion of the camera, and the red dots represent features without
#2 on the KITTI odometry benchmark irrespective of sensing depth. The proposed method uses all three types of features in determining
modality, and is rated #1 among visual odometry methods. motion. (b) A depth image from an RGB-D camera corresponding to (a),
where depth information is only available in the vicinity of the camera. The
I. I NTRODUCTION gray and white pixels represent close and far objects with respect to the
camera, and the black pixels are areas that depth is unavailable.
Visual odometry is the process of egomotion estimation
given a sequence of camera imagery. Typically, monocular
imagery is insufficient to compute the egomotion because used, which themselves provide constraints in solving the
motion along the camera optical axis can cause little motion problem. Further, the method contains a bundle adjustment
of visual features and therefore the estimation problem can which refines the motion estimates in parallel by processing
be degenerate. With a single camera [1]–[4], if assuming un- a sequence of images, in a batch optimization.
constrained motion, rotation can be recovered but translation The proposed method is not limited to RGB-D cameras.
is up to scale. This situation can be mitigated by using extra It can be adapted to various types of cameras as long as
information such as knowledge of a non-holonomic motion depth information can be acquired and associated. We have
constraint [5], or measurements from an IMU integrated with collected experimental data using an RGB-D camera and a
the visual odometry [6]. However, the results are dependent custom-built sensor consisting a camera and 3D lidar (a 2-
on the quality of the extra information involved. axis laser scanner). We have also evaluated the method using
It is possible to obtain scale by using multiple cameras the well-known KITTI benchmark datasets [9], [10], which
simultaneously [7], [8]. However, this comes at its own cost– contain carefully registered data from a number of sensors.
reduced effective field of view and a limitation on the range The method reported here uses images from a single camera
that can be accurately recovered from the multiple cameras. in a stereo pair and laser scans from a high rate lidar. The
If a small baseline is used, depth is uncertain for features results are ranked by the benchmark server.
far away from the camera. But, if the cameras are separated The rest of this paper is organized as follows. In section II,
significantly, inter-camera calibration becomes difficult and we discuss related work. In section III, we state the problem.
accuracy can be hard to ensure. When used in scenes where The sensor hardware and software system are described in
a large difference exists between near and far objects, depth Section IV. The frame to frame motion estimation and bundle
can only be obtained in certain parts of the images. adjustment are discussed in Sections V and VI. Experimental
This paper proposes a method that can effectively utilize results are in Section VII and conclusion in Section VIII.
depth information along with the imagery, even the depth is II. R ELATED W ORK
only sparsely provided. The method maintains and registers
a depth map using the estimated motion of the camera. Vision based methods are now common for motion es-
Visual features are associated with depth from the depth map, timation [11], [12]. To solve 6DOF camera motion, stereo
or by triangulation using the previously estimated motion. vision systems are often used [13], [14]. The methods track
Salient visual features for which depth is unavailable are also and match visual features between image frames. Depth
information of the features are retrieved by triangular geom-
J. Zhang, M. Kaess, and S. Singh are with the Robotics Insti- etry established from the two image views with the camera
tute at Carnegie Mellon University, Pittsburgh, PA 15213. Emails: baseline as a reference. Among this area, Paz et al. [15]
[email protected], [email protected], and [email protected].
The paper is based upon work supported by the National Science estimate the motion of stereo hand-hold cameras. The depth
Foundation under Grant No. IIS-1328930. is recovered for features close to the cameras, which help
978-1-4799-6933-3/14/$31.00 ©2014 IEEE 4973

solve scale of the translation estimation. Konolige, at al.’s plane that is parallel to the x − y plane of {C k } and at a
k
stereo visual odometry recovers the motion from bundle unit length in front of the coordinate origin, and X̄i is the
adjustment [8]. The method is integrated with an IMU and point projected onto the plane. With notations defined, our
capable for long distance off-road navigation. visual odometry problem can be described as
The introduction of RGB-D cameras has drawn great Problem: Given a sequence of image frames k, k ∈ Z + ,
attention in the research of visual odometry [16]–[19]. Such and features, I, compute the motion of the camera between
cameras provide RGB images along with depth information each two consecutive frames, k and k − 1, using Xki , if the
within the camera range limit. Huang et al. [20] use tracked k
depth is available, and X̄i , if the depth is unknown, i ∈ I.
visual features with known depth from an RGB-D camera
to compute the motion estimates. The method eliminates IV. S YSTEM OVERVIEW
features if the corresponding depth is unavailable from the A. Sensor Hardware
depth images. Henry et al. [21] integrate visual features
The proposed method is validated on three different sensor
with the Iterative Closest Point (ICP) method [22]. The
systems. The first two sensors shown in Fig. 2 are used
motion estimation employs a joint optimization by minimiz-
to acquire author-collected data, while the third one uses
ing combined 2D and 3D feature matching errors. Another
configuration of the KITTI benchmark datasets. Through
popular method is dense tracking [23], [24]. The method
the paper, we will use data from the first two sensors to
minimizes the photometric error using a dense 3D model of
illustrate the method. Fig. 2(a) is an Xtion Pro Live RGB-D
the environment from the depth images. Overall, the methods
camera. The camera is capable of providing RGB and depth
[20], [21], [23], [24] rely on sufficient depth information for
images at 30Hz, with 640×480 resolution and 58◦ horizontal
image processing. This can limit their applications especially
field of view. Fig. 2(b) shows a custom-built camera and 3D
if the methods are used in open environments, where depth
lidar. The camera can provide RGB images up to 60Hz, with
information can only be limitedly available.
744 × 480 resolution and 83◦ horizontal field of view. The
Few RGB-D visual odometry methods are able to handle
3D lidar is based on a Hokuyo UTM-30LX laser scanner,
insufficiently provided depth information. In the work of
which has 180◦ field of view with 0.25◦ resolution and 40
Hu et al., a heuristic switch is used to choose between an
lines/sec scan rate. The laser scanner is actuated by a motor
RGB-D and a monocular visual odometry method [25]. In
for rotational motion to realize 3D scan.
contrast to these methods [20], [21], [23]–[25], we propose
a single method to handle sparse depth information, by com- B. Software System Overview
bining both features with and without depth. The method is
Fig. 3 shows a diagram of the software system. First, visual
compared to [20], [23] experimentally, and robust estimation
features are tracked by the feature tracking block. Depth
results are shown. Further, since the method maintains and
images from RGB-D cameras or point clouds from lidars
registers a depth map, it can use depth information from
are registered by the depth map registration block, using the
different types of sensors. The method is currently tested
estimated motion. The block also associates depth for the
with depth from RGB-D cameras and lidars, but is supposed
visual features. The frame to frame motion estimation block
to work with depth from stereo cameras also.
takes the features as the input, and its output is refined by
III. N OTATIONS AND TASK D ESCRIPTION the bundle adjustment block using sequences of images. The
bundle adjustment runs at a low frequency (around 0.25-
The visual odometry problem addressed in this paper is
1.0Hz). The transform integration block combines the high
to estimate the motion of a camera using monocular images
frequency frame to frame motion with the low frequency
with assistance of depth information. We assume that the
refined motion, and generates integrated motion transforms at
camera is well modeled as a pinhole camera [26], the camera
the same frequency as the frame to frame motion transforms.
intrinsic parameters are known from pre-calibration, and the
Section V and VI present each block in detail.
lens distortion is removed. As a convention in this paper, we
use right superscript k, k ∈ Z + to indicate image frames.
Define camera coordinate system, {C}, as follows,
• {C} is a 3D coordinate system with its origin at the
camera optical center. The x-axis points to the left, the
y-axis points upward, and the z-axis points forward
coinciding with the camera principal axis.
We want to utilize features with and without depth. Let
(a) (b)
I be a set of feature points. For a feature i, i ∈ I, that is
Fig. 2. Two sensors involved in the evaluation. (a) An Xtion Pro Live
associated with depth, its coordinates in {C k } are denoted RGB-D camera. The camera is capable of providing 30Hz RGB and depth
as Xki , where Xki = [xki , yik , zik ]T . For a feature with images, with 640×480 resolution and 58◦ HFV. (b) A custom-built camera
unknown depth, we normalize the coordinates such that its and 3D lidar. The camera provides up to 60Hz RGB images with 744×480
k resolution and 83◦ HFV. The 3D lidar consists of a Hokuyo UTM-30LX
z-coordinate is one. Let X̄i be the normalized term of the laser scanner rotated by a motor to realize 3D scan. The laser scanner has
k k k T
feature, X̄i = [x̄i , ȳi , 1] . Intuitively, we can imagine a 180◦ FOV with 0.25◦ resolution and 40 lines/sec scan rate.
4974
Fig. 3. Block diagram of the visual odometry software system.
V. F RAME TO F RAME M OTION E STIMATION where θ̂ is the skew symmetric matrix of θ.

A. Mathematical Derivation Substituting (7) into (3)-(4), we can derive two equations
for a feature with depth, and substituting (7) into (6), we can
We start with mathematical derivation for the frame to
derive one equation for a feature with unknown depth. Each
frame motion estimation. The corresponding algorithm is
equation is a function of θ and T. Suppose we have a total of
discussed in the next section. Recall that Xk−1
i and Xki are the
k−1 m and n features with known and unknown depth. Stacking
coordinates of a tracked feature in {C } and {C k }. Define
the equations, we obtain a nonlinear function,
R and T as the 3 × 3 rotation matrix and 3 × 1 translation
vector of the camera between the two frames, we model the f([T; θ]) = , (8)
camera motion as rigid body transformation,
where f has 2m+n rows, is a (2m+n)×1 vector containing
Xki = RXik−1 + T. (1) the residuals, and [T; θ] is the vertical joining of the vectors.
Next, we will work on features with known depth. Acquir- Compute the Jacobian matrix of f with respect to [T; θ],
ing depth for the features will be discussed in the later part denoted as J, where J = ∂f /∂[T; θ]. (8) can be solved by
of this paper. Here, note that we only use depth information the Levenberg-Marquardt (LM) method [26],
from one of the two frames. We choose frame k − 1 for
[T; θ]T ← [T; θ]T − (JT J + λdiag(JT J))−1 JT . (9)
simplicity of implementation. This is because the depth maps
are registered in the camera coordinates by the previously Here, λ is a scale factor determined by the LM method.
estimated motion. By the time of frame k, the depth map at
frame k −1 is available, and the depth of Xik−1 is associated. B. Motion Estimation Algorithm
k
Recall that X̄i is the normalized term of Xki , where the z- The frame to frame motion estimation algorithm is p-
coordinate is one, we rewrite (1) as resented in Algorithm 1. As we have discussed that the
k proposed method only uses depth information associated at
zik X̄i = RXk−1
i + T. (2)
frame k − 1, all features taken as the input at frame k are
Eq. (2) contains three rows. Combining the 1st and 2nd rows
with the 3rd row, respectively, we can eliminate zik . This
gives us two equations as follows, Algorithm 1: Frame to Frame Motion Estimation
k k−1
(R1 − x̄ki R3 )Xik−1 + T1 − x̄ki T3 = 0, (3) 1 input : X̄i , Xik−1 or X̄i , i ∈ I
2 output : θ, T
(R2 − ȳik R3 )Xik−1 + T2 − ȳik T3 = 0, (4) 3 begin
4 θ, T ← 0;
where Rh and Th , h ∈ {1, 2, 3} are the h-th row of R and 5 for a number of iterations do
T, respectively. 6 for each i ∈ I do
For a feature with unknown depth, we rewrite (1) as the 7 if i is depth associated then
k−1 k
following. Here, X̄i is the normalized term of Xik−1 , 8 Derive (3)-(4) using X̄i and Xik−1 , substitute
k k−1
(7) into (3)-(4) to obtain two equations, stack
zik X̄i = zik−1 RX̄i + T. (5) the equations into (8);
9 end
Eq. (5) also contains three rows. Combining all rows to 10 else
eliminate both zik and zik−1 , we obtain, 11
k k−1
Derive (6) using X̄i and X̄i , substitute (7)
k−1 into (6) to obtain one equation, stack the
[−ȳik T3 + T2 , x̄ki T3 − T1 , − x̄ki T2 + ȳik T1 ]RX̄i = 0. (6) equation into (8);
12 end
So far, we have modeled the frame to frame motion for 13 Compute a bisquare weight for each feature
features with and without depth separately. Now, we will based on the residuals in (3)-(4) or (6);
solve the motion using both types of features. Define θ as a 14 Update θ, T for one iteration based on (9);
3 × 1 vector, θ = [θx , θy , θz ]T , where θx , θy , and θz are the 15 end
16 if the nonlinear optimization converges then
rotation angles of the camera around the x-, y-, and z- axes,
17 Break;
between frames k and k − 1. The rotation matrix R can be 18 end
expressed by the Rodrigues formula [27], 19 end
2 20 Return θ, T;
θ̂ θ̂ end
R = eθ̂ = I + sin ||θ|| + (1 − cos ||θ||), (7) 21
||θ|| ||θ||2
4975
k
without depth, as X̄i . Features at frame k − 1 are separated angular interval among the points viewed from the origin of
k−1
into Xik−1 and X̄i for those with and without depth. On {C k−1 }, or the optical center of the camera at frame k − 1.
line 8, a feature with depth contributes two equations to Here, we choose angular interval over Euclidean distance
the nonlinear function (8), and on line 11, a feature with interval with the consideration that an image pixel represents
unknown depth contributes one equation. an angular interval projected into the 3D environment. We
The terms θ and T are initialized to zero on line 4. The use the same format for the depth map.
algorithm is adapted to a robust fitting framework [28]. On The map points are converted into a spherical coordinate
line 13, the algorithm assigns a bisquare weight for each fea- system coinciding with {C k−1 }. A point is represented by
ture, based on their residuals in (3)-(4) and (6), respectively. its radial distance, azimuthal angle, and polar angle. When
Features that have larger residuals are assigned with smaller downsizing, only the two angular coordinates are considered,
weights, and features with residuals larger than a threshold and the points are evenly distributed with respect to the
are considered as outliers and assigned with zero weights. angular coordinates. This results in a denser point distribution
On line 14, θ and T are updated for one iteration. The that is closer to the camera, and vice versa. An example of
nonlinear optimization terminates if convergence is found, or a registered depth map is shown in Fig. 4(a), color coded
the maximum iteration number is met. Finally, the algorithm by elevation. The point clouds are collected by the lidar in
returns the motion estimation θ and T. Fig. 2(b), while the camera points to a wall.
To associate depth to the visual features, we store the
C. Feature Depth Association depth map in a 2D KD-tree [29] based on the two angular
In this section, we discuss how to associate depth to the coordinates. For each feature i, i ∈ I, we find three points
visual features. A depth map is registered by the estimated from the KD-tree that are the closest to the feature. The three
motion of the camera. The depth map is projected to the points form a local planar patch in the 3D environment, and
last image frame whose transform to the previous frame is the 3D coordinates of the feature are found by projecting
k−1
established. Here, we use frame k − 1 to keep the same onto the planar patch. Denote X̂j , j ∈ {1, 2, 3} as the
convention with the previous sections. Euclidean coordinates of the three points in {C k−1 }, and
New points are added to the depth map upon receiving recall that Xk−1
i is the coordinates of feature i in {C k−1 }.
from depth images or point clouds. Only points in front of The depth is computed by solving a function,
the camera are kept, and points that are received a certain k−1 k−1 k−1 k−1 k−1
time ago are forgotten. Then, the depth map is downsized to (Xik−1 − X̂1 )((X̂1 − X̂2 )×(X̂1 − X̂3 )) = 0. (10)
maintain a constant point density. We want to keep an even Further, if the depth is unavailable from the depth map for
some features but they are tracked for longer than a certain
distance in the Euclidean space, we triangulate the features
using the first and the last frames in the image sequences
where the features are tracked. Fig. 4(b) gives an example of
depth associated features, corresponding to Fig. 1. The white
colored dots are points on the depth map, only available
within a limited range. The green colored dots represent
features whose depth are provided by the depth map, and
the blue colored dots are from triangulation.
(a) VI. B UNDLE A DJUSTMENT

The camera frame to frame motion estimated in the pre-
vious section is refined by a bundle adjustment, which takes
a sequence of images and performs a batch optimization.
As a trade-off between accuracy and processing time, we
choose one image out of every five images as the bundle
adjustment input. The image sequence contains a number of
eight images (taken from 40 original images). This allows the
batch optimization to finish before the next image sequence
is accumulated and ready for processing. The bundle adjust-
(b)
ment uses the open source library iSAM [30]. We choose
Fig. 4. (a) An example of the depth map with point clouds perceived by iSAM over other libraries because it supports user-defined
the lidar in Fig. 2(b). The points are color coded by elevation. The camera
points to a wall during the data collection. (b) Features projected into the camera models, and can conveniently handle both features
3D environment corresponding to Fig. 1. The depth map (white colored with and without available depth.
dots) is from depth images collected by the RGB-D camera in Fig. 2(a), Here, we define another representation of the features,
only available within a limited range. The green colored dots are features k
whose depth is provided by the depth map, and the blue colored dots are X̃i = [x̄ki , ȳik , zik ]T , where the x- and y- entries contain the
from triangulation using the previously estimated motion. normalized coordinates, and the z-entry contains the depth.
4976
types of environments shown in Fig. 6: a conference room,
a large lobby, a clustered road, and a flat lawn. The difficulty
increases over the tests as the environments are opener
Fig. 5. Illustration of transform integration. The curves represent trans-
forms. The green colored segment is refined by the bundle adjustment and and depth information changes from dense to sparse. We
become the blue colored segment, published at a low frequency. The orange present two images from each dataset, on the 2nd and 4th
colored segments represent frame to frame motion transforms, generated at rows in Fig. 6. The red colored areas indicate coverage of
a high frequency. The transform integration step takes the orange segment
from the green segment and connects it to the blue segment. This results in depth maps, from the RGB-D camera (right figure) and the
integrated motion transforms published at the high frequency. lidar (left figure). Here, note that the depth map registers
depth images from the RGB-D camera or point clouds from
For features without depth, zik is set at a default value. Let the lidar at multiple frames, and usually contains more
J be the set of image frames in the sequence, and let l information than that from a single frame. With the RGB-
be the first frame in the set. Upon initialization, all features D camera, the average amount of imaged area covered by
appear in the sequence are projected into {C l }, denoted as the depth map reduces from 94% to 23% over the tests.
l l
X̃i , i ∈ I. Define Tlj as the transform projecting X̃i from The lidar has a longer detection range and can provide more
{C l } to {C j }, where j is a different frame in the sequence, depth information in open environments. The depth coverage
j ∈ J \ {l}. The bundle adjustment minimizes the following changes from 89% to 47% of the images.
function by adjusting the motion transform between each two The camera frame rate is set at 30Hz for both sensors.
l
consecutive frames and the coordinates of X̃i , To evenly distribute the features within the images, we
X j l j l j
separate an image into 3 × 5 identical subregions. Each
min (Tl (X̃i ) − X̃i )T Ωji (Tlj (X̃i ) − X̃i ), subregion provides maximally 30 features, giving maximally
i,j 450 features in total. The method is compared to two popular
i ∈ I, j ∈ J \ {l}. (11) RGB-D visual odometry methods. Fovis estimates the motion
j of the camera by tracking image features, and depth is
Here, X̃i represents the observation of feature i at frame j, associated to the features from the depth images [20]. DVO
and Ωji is its information matrix. The first two entries on is a dense tracking method that minimizes the photometric
the diagonal of Ωji are given constant values. If the depth is error within the overall images [23]. Both methods use data
from the depth map, the 3rd entry is at a larger value, and from the RGB-D camera. Our method is separated into two
if the depth is from triangulation, the value is smaller and versions, using the two sensors in Fig. 2, respectively. The
is inversely proportional to the square of the depth. A zero resulting trajectories are presented on the 1st and 3rd rows
value is used for features with unknown depth. in Fig. 6, and the accuracy is compared in Table I, using
The bundle adjustment publishes refined motion trans- errors in 3D coordinates. Here, the camera starts and stops
forms at a low frequency. With the camera frame rate at the same position, and the gap between the two ends
between 10-40Hz, the bundle adjustment runs at 0.25-1.0Hz. of a trajectory compared to the length of the trajectory is
As illustrated in Fig. 5, a transform integration step takes considered the relative position error.
the bundle adjustment output and combines it with the high From these results, we conclude that all four methods
frequency frame to frame motion estimates. The result is function similarly when depth information is sufficient (in
integrated motion transforms published at the high frequency the room environment), while the relative error of DVO
as the frame to frame motion transforms. is slightly lower than the other methods. However, as the
VII. E XPERIMENTS depth information becomes sparser, the performance of Fovis
and DVO reduces significantly. During the last two tests,
The visual odometry is tested with author-collected data
Fovis frequently pauses without giving odometry output due
and the KITTI benchmark datasets. It tracks Harris corners
to insufficient number of inlier features. DVO continuously
[26] by the Kanade Lucas Tomasi (KLT) method [31]. The
generates output but drifts heavily. This is because both
program is implemented in C++ language, on robot operating
methods use only imaged areas where depth is available,
system (ROS) [32] in Linux. The algorithms run on a laptop
leaving large amount of areas in the visual images being
computer with 2.5GHz cores and 6GB memory, using around
unused. On the other hand, the two versions of our method
three cores for computation. The feature tracking and bundle
adjustment (Section VI) take one core each, and the frame TABLE I
to frame motion estimation (Section V) and depth map R ESULTS USING AUTHOR - COLLECTED DATA . T HE ERROR IS MEASURED
registration together consume another core. Our software AT THE END OF A TRAJECTORY AS A % OF THE DISTANCE TRAVELED
code and datasets are publicly available1 , in two different
versions based on the two sensors in Fig. 2. Relative position error
Envir- Dist- Our VO Our VO
A. Tests with Author-collected Datasets onment ance Fovis DVO (RGB-D) (Lidar)
We first conduct tests with author-collected datasets using Room 16m 2.72% 1.87% 2.14% 2.06%
Lobby 56m 5.56% 8.36% 1.84% 1.79%
the two sensors in Fig. 2. The data is collected from four
Road 87m 13.04% 13.60% 1.53% 0.79%
1 wiki.ros.org/demo_rgbd and wiki.ros.org/demo_lidar Lawn 86m 9.97% 32.07% 3.72% 1.73%
4977
(a) (b)
(c) (d) (e) (f)
(g) (h)
(i) (j) (k) (l)

Fig. 6. Comparison of four methods using author-collected datasets: Fovis, DVO, and two versions of our method using depth from an RGB-D camera
and a lidar. The environments are selected respectively from a conference room ((a), (c), and (d)), a large lobby ((b), (e), and (f)), a clustered road ((g), (i),
and (j)), and a flat lawn ((h), (k), and (l)). We present two images from each dataset. The red colored areas indicate availability of depth maps, from the
RGB-D camera (right figure) and the lidar (left figure). The depth information is sparser from each test to the next as the environment becomes opener,
resulting in the performance of Fovis and DVO reduces significantly. Our methods relatively keep the accuracy in the tests.
are able to maintain accuracy in the tests, except that the vehicle is equipped with color stereo cameras, monochrome
relative error of the RGB-D camera version is relatively large stereo cameras, a 360◦ Velodyne laser scanner, and a high
in the lawn environment, because the depth is too sparse accuracy GPS/INS for ground truth. Both image and laser
during the turning on top of Fig. 6(h). data are logged at 10Hz. The image resolution is around
1230 × 370 pixels, with 81◦ horizontal field of view. Our
B. Tests with KITTI Datasets
method uses the imagery from the left monochrome camera
The proposed method is further tested with the KITTI and the laser data, and tracks maximally 2400 features from
datasets. The datasets are logged with sensors mounted on 3 × 10 identical subregions in the images.
the top of a passenger vehicle, in road driving scenarios. The
4978
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)

Fig. 7. Sample results of the proposed method using the KITTI datasets. The datasets are chosen from three types of environments: urban, country, and
highway from left to right. In (a)-(c), we compare results of the method with and without the bundle adjustment, to the GPS/INS ground truth. The black
colored dots are the starting points. An image is shown from each dataset to illustrate the three environments, in (d)-(f), and the corresponding laser point
cloud in (g)-(i). The points are colored coded by depth, where red color indicates near objects and blue color indicates far objects.
TABLE II
C ONFIGURATIONS AND RESULTS OF THE KITTI DATASETS . T HE ERROR
IS MEASURED USING SEGMENTS OF A TRAJECTORY AT 100 M , 200 M , ...,
800 M LENGTHES , AS AN AVERAGED % OF THE SEGMENT LENGTHES .
Data Configuration Mean relative

no. Distance Environment position error
0 3714m Urban 1.05%
1 4268m Highway 1.87%
2 5075m Urban + Country 0.93%
3 563m Country 0.99%
4 397m Country 1.23%
5 2223m Urban 1.04%
6 1239m Urban 0.96% Fig. 8. Comparison of relative position errors in urban, country, and
7 695m Urban 1.16% highway environments, tested with the KITTI datasets. In each environment,
8 3225m Urban + Country 1.24% we compare errors with and without the bundle adjustment. The black, blue,
and red colored lines indicate 100%, 75%, and median of the errors.
by depth. The complete test results with the 11 datasets
The datasets contain 11 tests with the GPS/INS ground are listed in Table II. The three tests from left to right in
truth provided. The data covers mainly three types of en- Fig. 7 are respectively datasets 0, 3, and 1 in the table.
vironments: “urban” with buildings around, “country” on Here, the accuracy is measured by averaging relative position
small roads with vegetations in the scene, and “highway” errors using segments of a trajectory at 100m, 200m, ...,
where roads are wide and the surrounding environment is 800m lengthes, based on 3D coordinates. On the KITTI
relatively clean. Fig. 7 presents sample results from the benchmark2 , our accuracy is comparable to state of the art
three environments. On the top row, the results of the stereo visual odometry [33], [34], which retrieve depth from
proposed method are compared to the ground truth, with and stereo imagery without aid from the laser data.
without using the bundle adjustment introduced in Section Further, to inspect the effect of the bundle adjustment, we
VI. On the middle and bottom rows, an image and the compare accuracy of the results in the three environments.
corresponding laser point cloud is presented from each of
the three datasets, respectively. The points are color coded 2 www.cvlibs.net/datasets/kitti/eval_odometry.php
4979
The 11 datasets are manually separated into segments and [11] D. Nister, O. Naroditsky, and J. Bergen, “Visual odometry for ground
labeled with an environment type. For each environment, vechicle applications,” Journal of Field Robotics, vol. 23, no. 1, pp.
3–20, 2006.
the visual odometry is tested with and without the bundle [12] M. Maimone, Y. Cheng, and L. Matthies, “Two years of visual
adjustment. Fig. 8 shows the distributions of the relative odometry on the mars exploration rovers,” Journal of Field Robotics,
errors. Overall, the bundle adjustment helps reduce the mean vol. 24, no. 2, pp. 169–186, 2007.
[13] A. Howard, “Real-time stereo visual odometry for autonomous ground
errors by 0.3%-0.7%, and seems to be more effective in urban vehicles,” in IEEE International Conference on Intelligent Robots and
and country scenes than on highways, partially because the Systems, Nice, France, Sept 2008.
feature quality is lower in the highway scenes. [14] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3D recon-
struction in real-time,” in IEEE Intelligent Vehicles Symposium, Baden-
VIII. C ONCLUSION AND F UTURE W ORK Baden, Germany, June 2011.
[15] L. Paz, P. Pinies, and J. Tardos, “Large-scale 6-DOF SLAM with
The scenario of insufficiency in depth information is stereo-in-hand,” IEEE Transactions on Robotics, vol. 24, no. 5, pp.
common for RGB-D cameras and lidars which have limited 946–957, 2008.
[16] R. Newcombe, A. Davison, S. Izadi, P. Kohli, O. Hilliges, J. Shotton,
ranges. Without sufficient depth, solving the visual odometry D. Molyneaux, S. Hodges, D. Kim, and A. Fitzgibbon, “KinectFusion:
is hard. Our method handles the problem by exploring both Real-time dense surface mapping and tracking,” in IEEE International
visual features whose depth is available and unknown. The Symposium on Mixed and Augmented Reality, 2011, pp. 127–136.
[17] N. Engelhard, F. Endres, J. Hess, J. Sturm, and W. Burgard, “Real-
depth is associated to the features in two ways, from a depth time 3D visual SLAM with a hand-held RGB-D camera,” in RGB-D
map and by triangulation using the previously estimated Workshop on 3D Perception in Robotics at the European Robotics
motion. Further, a bundle adjustment is implemented which Forum, 2011.
[18] T. Whelan, H. Johannsson, M. Kaess, J. Leonard, and J. McDonald,
refines the frame to frame motion estimates. The method “Robust real-time visual odometry for dense RGB-D mapping,” in
is tested with author-collected data using two sensors and IEEE International Conference on Robotics and Automation, 2013.
the KITTI benchmark datasets. The results are compared [19] I. Dryanovski, R. Valenti, and J. Xiao, “Fast visual odometry and
mapping from RGB-D data,” in IEEE International Conference on
to popular visual odometry methods, and the accuracy is Robotics and Automation (ICRA), Karlsruhe, Germany, 2013.
comparable to state of the art stereo methods. [20] A. Huang, A. Bachrach, P. Henry, M. Krainin, D. Maturana, D. Fox,
Considering future work, the current method uses Harris and N. Roy, “Visual odometry and mapping for autonomous flight
using an RGB-D camera,” in Int. Symposium on Robotics Research
corners tracked by the KLT method. We experience difficulty (ISRR), 2011.
of reliably tracking features in some indoor environments, [21] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “RGB-D
such as a homogenously colored corridor. Improvement of mapping: Using kinect-style depth cameras for dense 3D modeling of
indoor environments,” The International Journal of Robotics Research,
feature detection and tracking is needed. Further, the method vol. 31, no. 5, pp. 647–663, 2012.
is currently tested with depth information from RGB-D [22] S. Rusinkiewicz and M. Levoy, “Efficient variants of the ICP algo-
cameras and lidars. In the future, we will try to utilize depth rithm,” in Third International Conference on 3D Digital Imaging and
Modeling (3DIM), Quebec City, Canada, June 2001.
provided by stereo cameras, and possibly extend the scope [23] C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for
of our method to stereo visual odometry. RGB-D cameras,” in IEEE International Conference on Robotics and
Automation, Karlsruhe, Germany, 2013.
R EFERENCES [24] J. Sturm, E. Bylow, C. Kerl, F. Kahl, and D. Cremer, “Dense tracking
and mapping with a quadrocopter,” in Unmanned Aerial Vehicle in
[1] G. Klein and D. Murray, “Parallel tracking amd mapping for small Geomatics (UAV-g), Rostock, Germany, 2013.
AR workspaces,” in Proc. of the International Symposium on Mixed [25] G. Hu, S. Huang, L. Zhao, A. Alempijevic, and G. Dissanayake, “A
and Augmented Reality, Nara, Japan, Nov. 2007, pp. 1–10. robust RGB-D slam algorithm,” in IEEE/RSJ International Conference
[2] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense on Intelligent Robots and Systems, Vilamoura, Portugal, Oct. 2012.
tracking and mapping in real-time,” in IEEE International Conference [26] R. Hartley and A. Zisserman, Multiple View Geometry in Computer
on Computer Vision, 2011, pp. 2320–2327. Vision. New York, Cambridge University Press, 2004.
[3] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry for [27] R. Murray and S. Sastry, A mathematical introduction to robotic
a monocular camera,” in IEEE International Conference on Computer manipulation. CRC Press, 1994.
Vision (ICCV), Sydney, Australia, Dec. 2013. [28] R. Andersen, “Modern methods for robust regression.” Sage University
[4] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct Paper Series on Quantitative Applications in the Social Sciences, 2008.
monocular visual odometry,” in IEEE International Conference on [29] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf,
Robotics and Automation (ICRA), May 2014. Computation Geometry: Algorithms and Applications, 3rd Edition.
[5] D. Scaramuzza, “1-point-ransac structure from motion for vehicle- Springer, 2008.
mounted cameras by exploiting non-holonomic constraints,” Interna- [30] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. Leonard, and F. Del-
tional Journal of Computer Vision, vol. 95, p. 7485, 2011. laert, “iSAM2: Incremental smoothing and mapping using the Bayes
[6] S. Weiss, M. Achtelik, S. Lynen, M. Achtelik, L. Kneip, M. Chli, and tree,” International Journal of Robotics Research, vol. 31, pp. 217–
R. Siegwart, “Monocular vision for long-term micro aerial vehicle 236, 2012.
state estimation: A compendium,” Journal of Field Robotics, vol. 30, [31] B. Lucas and T. Kanade, “An iterative image registration technique
no. 5, pp. 803–831, 2013. with an application to stereo vision,” in Proceedings of Imaging
[7] P. Corke, D. Strelow, and S. Singh, “Omnidirectional visual odometry Understanding Workshop, 1981, pp. 121–130.
for a planetary rover,” in Proc. of the IEEE/RSJ International Con- [32] M. Quigley, B. Gerkey, K. Conley, J. Faust, T. Foote, J. Leibs,
ference on Intelligent Robots and Systems, Sendai, Japan, Sept. 2004, E. Berger, R. Wheeler, and A. Ng, “ROS: An open-source robot
pp. 149–171. operating system,” in Workshop on Open Source Software (Collocated
[8] K. Konolige, M. Agrawal, and J. Sol, “Large-scale visual odometry with ICRA 2009), Kobe, Japan, May 2009.
for rough terrain,” Robotics Research, vol. 66, p. 201212, 2011. [33] A. Y. H. Badino and T. Kanade, “Visual odometry by multi-frame
[9] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous feature integration,” in Workshop on Computer Vision for Autonomous
driving? The kitti vision benchmark suite,” in IEEE Conference on Driving (Collocated with ICCV 2013), Sydney, Australia, 2013.
Computer Vision and Pattern Recognition, 2012, pp. 3354–3361. [34] H. Badino and T. Kanade, “A head-wearable short-baseline stereo
[10] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: system for the simultaneous estimation of structure and motion,” in
The KITTI dataset,” International Journal of Robotics Research, IAPR Conference on Machine Vision Application, Nara, Japan, 2011.
no. 32, pp. 1229–1235, 2013.
4980

Real-Time Depth Enhanced Monocular Odometry: Ji Zhang, Michael Kaess, and Sanjiv Singh

Uploaded by

Copyright:

Available Formats

Real-Time Depth Enhanced Monocular Odometry: Ji Zhang, Michael Kaess, and Sanjiv Singh

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Real-Time Depth Enhanced Monocular Odometry: Ji Zhang, Michael Kaess, and Sanjiv Singh

Uploaded by

Copyright:

Available Formats

2014 IEEE/RSJ International Conference on

Intelligent Robots and Systems (IROS 2014)

Real-time Depth Enhanced Monocular Odometry

Abstract— Visual odometry can be augmented by depth in-

978-1-4799-6933-3/14/$31.00 ©2014 IEEE 4973

V. F RAME TO F RAME M OTION E STIMATION where θ̂ is the skew symmetric matrix of θ.

(a) VI. B UNDLE A DJUSTMENT

(c) (d) (e) (f)

(i) (j) (k) (l)

(d) (e) (f)

(g) (h) (i)

Data Configuration Mean relative

You might also like