Real-Time 6-DOF Monocular Visual SLAM in A Large-Scale Environment
Real-Time 6-DOF Monocular Visual SLAM in A Large-Scale Environment
in a Large-Scale Environment
Hyon Lim1 , Jongwoo Lim2 and H. Jin Kim1
Abstract Real-time approach for monocular visual simultaneous localization and mapping (SLAM) within a large-scale
environment is proposed. From a monocular video sequence,
the proposed method continuously computes the current 6DOF camera pose and 3D landmarks position. The proposed
method successfully builds consistent maps from challenging
outdoor sequences using a monocular camera as the only sensor,
while existing approaches have utilized additional structural
information such as camera height from the ground. By using
a binary descriptor and metric-topological mapping, the system
demonstrates real-time performance on a large-scale outdoor
environment without utilizing GPUs or reducing input image
size. The effectiveness of the proposed method is demonstrated
on various challenging video sequences including the KITTI
dataset and indoor video captured on a micro aerial vehicle.
I. INTRODUCTION
Simultaneous Localization and Mapping (SLAM) is the
way of building a consistent map within an unknown
environment while keeping track of the current location at
the same time. This problem has been studied over the past
three decades in robotics and recently in computer vision
fields. As cameras become ubiquitous in many robot systems,
visual SLAM research draws increasing interests. As a result,
the visual SLAM problem has been acknowledged as a key
challenge to enable fully autonomous robots without resorting
to external positioning systems such as GPS or heavy and
expensive laser range finders.
The visual SLAM methods are classified into two main categories by the number of cameras employed: monocular and
stereo. The monocular systems have several advantages over
stereo systems in terms of cost, flexibility, and computational
efficiency. A single camera always costs less than stereo
camera systems, and also provides flexibility in installation
of the camera to robots. For example, a stereo camera should
have more than a half meter baseline for enough disparity
when it is operated in a car for outdoor navigation [1].
However, robots like micro aerial vehicles (MAVs) may not
have the space for a wide baseline stereo camera at all.
Despite the advantages of a monocular camera, the nature
of a monocular camera, which only provides bearing angles of
visual features, has made monocular visual SLAM difficult [2],
[3]. Therefore, a solid temporal feature association method
is critical for monocular visual SLAM systems to achieve
performance and stability comparable to stereo system.
In many robotics literature, the visual odometry (VO) also
has been intensively studied. The main difference between
1 School of Mechanical and Aerospace Engineering, Seoul National
University, Korea. {hyonlim,hjinkim}@snu.ac.kr
2 Corresponding Author, Division of Computer Science and Engineering,
Hanyang University, Korea. [email protected]
SLAM Evaluation
22 sequence
with respect to it. Some researchers have introduced robocentric, ego-centric coordinates [18]. The origin of this
system is the current coordinate frame of the robot, and other
poses and landmarks are defined by the robots coordinate
system. The disadvantage of these notations is that not all
measurements are available between the current pose and all
landmarks. Therefore, some of landmarks will be described by
integration of error-corrupted poses. This error is difficult to
be minimized or very slowly converges, which is impractical
in many cases. Our representation resembles with robo-centric
coordinate system. However, keyframes are described by
relative transformation which is quite accurate compared to
other estimates. A landmark is defined with its anchor frame
which is a frame that observes the landmark for the first time.
Mei et al [19] proposed Continuous Relative Representation
(CRR). The idea is similar as ours, but no specific preference
of frame origin to describe keyframes. We describe the pose
graph only with relative transformation as an edge between
pose vertices.
III. MONOCULAR SLAM PIPELINES
Robust feature tracking is an important stage of the
monocular visual SLAM. To compute a 3D position of a
feature, it should be tracked until it has reasonable disparity
to be triangulated. In this section, we describe the steps
performed for each input image.
A. Pipeline overview
Keypoints are extracted by the FAST [20] detector. The
keypoints are tracked in the following frames by matching
them to the candidate keypoints within their local search
neighborhood. The BRIEF [16] binary feature descriptor
which is very light to compute, is used to match the
keypoints. This fast tracker is tightly integrated with the
scene recognition module that is using the bag of binary
visual words approach [17] to find nearby keyframes and the
matching 3D landmarks in the map. Thanks to the efficiency
of binary descriptor matching, scene recognition can be done
at frame rate.
Typically, the loop closing (i.e., scene recognition) has
been performed by using separate full-featured descriptors
like SIFT [21] or SURF [22] which are expensive to compute
in real-time. The main reason is that those descriptors
give considerably high matching rate among the existing
approaches. However, those descriptors are not used in
temporal feature association for pose estimation due to its
high computation cost. In other words, these expensive
descriptors should be extracted and managed separately
from the features used by the keypoint tracker (e.g., image
patch used with Kanade-Lucas-Tomasi tracker [23]) to query
matching keyframes to the current keyframe in the database.
In this paper, we have developed unified keypoint tracking
and scene recognizing system by using homogeneous type
of descriptor: the BRIEF descriptor. A new keypoint tracker
is developed based on a binary descriptor to exploit the efficiency of matching of binary descriptors for scene recognition.
As a result, we are able to perform full visual SLAM tasks
Word 1
Word 2
...
d135
d176
...
Image 2
d2130
...
...
..
.
..
.
...
Image 1
l=1
l12
l00
z00
z01
Index
z11
z22
z02
l22
Image 3
2
Vocabulary words vi
l=0
Image 11
Image 20
...
v23 = 0.23
..
.
...
v211 = 0.41
..
.
v220 = 0.66
..
.
...
zim
...
lim
zi1
Landmark anchor
z2m
P12
Keyframe node
Pm2
P21
T
Pm1 = P1m
P1m
Relative transformation
Measurement
P2m
Landmark node
P11 = I44
C. Pose estimation
In our implementation, two types of pose estimation exist.
One is for tracking of map landmarks, and the other one is
for keyframe initialization when 3D points are not available.
Algorithm details are described below.
Keyframe initialization. When the system is initialized
from scratch, there are no 3D landmarks available as we
consider a monocular camera. We only have 2D-2D keypoint
matches when the first keyframe is added. In this case, we
use the 5-point algorithm to two selected frames [24] with
RANSAC [25] to estimate the relative pose. The second
view is selected if the number of features tracked from the
first view is less than the threshold.
Environment map tracking. Each 3D landmark has an
associated BRIEF descriptor, so it can be matched with
2D points for pose estimation. A 3D point is projected to
the image space based on the prior pose, then we search
2D-3D correspondences with the idea of keypoint tracking
in section III-B. When 2D-3D correspondences are available,
the 3-point pose estimation [26] with RANSAC gives the
relative pose (R, t). The computed pose is further optimized
using the Levenberg-Marquardt algorithm by minimizing
reprojection error.
Pose estimation with loop closure. In every keyframe
addition, the tracked BRIEF descriptors are provided to the
vocabulary tree. When we pass the descriptors to the loop
closure detector, it seeks the best keyframe matched. We
associate each 3D point to a keypoint which is passed to the
vocabulary tree to utilize direct index. This will be explained
in the following section.
D. Loop closing
Scene recognition. Our feature tracker will provide the
most recent BRIEF descriptor tracked until the current frame.
These descriptors are provided to the vocabulary tree as
k, j {0, . . . , m},
(1)
Metric embedding
Metric map
Topological map
Relative transformation
Measurement
Landmark anchor
Keyframe node
Fixed keyframe
Active Landmark
Landmark node
Active keyframe
Reference keyframe
(2)
zik R3 ,
De-embeddding
(4)
Detected loop
200
100
0
100
200
10
15
20
400
300
200
100
0
0
50
100
150
200
250
300
350
400
450
Time [s]
(a)
(b)
(c)
Fig. 9: Loop closing. Large-scale loop closure detected in sequence #0 of the KITTI dataset. The loop closure is marked in
the figure. (a) A frame before the loop closure detection. Long distance travel without loop closing makes the current camera
location drift. (b) Loop closure is performed by shifting the pose graph. This will correct drift immediately but the metric
property of map may not invalid. (c) After pose-graph optimization, the corrected (pose error and scale) map is obtained.
400
Ground Truth
Proposed Method
Sequence Start
500
Ground Truth
Proposed Method
Sequence Start
Ground Truth
Proposed Method
Sequence Start
300
Ground Truth
Proposed Method
Sequence Start
Ground Truth
Proposed Method
Sequence Start
600
100
500
300
400
200
400
50
300
300
200
z [m]
z [m]
200
z [m]
z [m]
z [m]
100
200
100
100
0
100
0
0
-50
-100
-100
-200
-100
-100
-300
-200
-100
0
x [m]
100
200
300
-100
-200
-100
0
x [m]
100
200
-200
-100
0
x [m]
100
200
-200
-150
-100
x [m]
-50
-400
-300
-200
-100
0
x [m]
100
200
300
400
Fig. 10: Ground truth comparison. Trajectories estimated by the proposed method (blue dotted line) and ground truth (red
line) in KITTI dataset.
Fig. 11: Environment mapping results of KITTI dataset. Landmarks (black dots) are drawn with camera trajectory (cyan
line). Our approach worked well for those sequences.
Matching
50
150
40
100
Time [ms]
Time [ms]
45
50
35
30
25
20
500
1000
1500
2000
2500
3000
3500
4000
500
4500
1000
1500
2000
2500
3000
3500
4000
4500
Tracking
20
1.8
15
1.4
Time [ms]
Time [ms]
1.6
1.2
1
10
0.8
0.6
0.4
500
1000
1500
2000
2500
3000
3500
4000
4500
500
1000
1500
2000
10
80
8
Time [ms]
Time [ms]
3000
3500
4000
4500
4000
4500
P3P
Keyframe
100
60
40
6
4
2
20
0
2500
500
1000
1500
2000
2500
3000
3500
4000
4500
500
1000
1500
2000
2500
3000
3500
Fig. 12: Runtime performance. Per-frame processing time for sequence 0 of the KITTI dataset.
[18] B. Williams and I. Reid, On combining visual slam and visual odometry, in Robotics and Automation (ICRA), 2010 IEEE International
Conference on. IEEE, 2010, pp. 34943500.
[19] C. Mei, G. Sibley, M. Cummins, P. Newman, and I. Reid, Rslam:
A system for large-scale mapping in constant-time using stereo,
International Journal of Computer Vision, vol. 94, no. 2, pp. 198
214, 2011.
[20] E. Rosten and T. Drummond, Machine learning for high-speed corner
detection, in ECCV. Springer, 2006.
[21] D. G. Lowe, Distinctive image features from scale-invariant keypoints,
IJCV, vol. 60, no. 2, pp. 91110, 2004.
[22] H. Bay, T. Tuytelaars, and L. Van Gool, Surf: Speeded up robust
features, in ECCV. Springer, 2006, pp. 404417.
[23] B. D. Lucas, T. Kanade, et al., An iterative image registration technique
with an application to stereo vision. in IJCAI, vol. 81, 1981, pp. 674
679.
[24] D. Nister, An efficient solution to the five-point relative pose problem,
PAMI, vol. 26, no. 6, pp. 756770, 2004.
[25] M. A. Fischler and R. C. Bolles, Random sample consensus: a
paradigm for model fitting with applications to image analysis and
automated cartography, Communications of the ACM, vol. 24, no. 6,
pp. 381395, 1981.
[26] L. Kneip, D. Scaramuzza, and R. Siegwart, A novel parametrization
of the perspective-three-point problem for a direct computation of
absolute camera position and orientation, in CVPR. IEEE, 2011, pp.
29692976.
[27] J. Lim, J.-M. Frahm, and M. Pollefeys, Online environment mapping
using metric-topological maps, The International Journal of Robotics
Research, vol. 31, no. 12, pp. 13941408, 2012.
[28] S. Agarwal, K. Mierle, and Others, Ceres solver, https://code.google.
com/p/ceres-solver/.
[29] A. Geiger, J. Ziegler, and C. Stiller, Stereoscan: Dense 3d reconstruction in real-time, in Intelligent Vehicles Symposium (IV), 2011 IEEE.
IEEE, 2011, pp. 963968.