Real Time Tracking
Real Time Tracking
Real Time Tracking
Zilong Dong1
1
Guofeng Zhang1
Jiaya Jia2
2
Hujun Bao1
State Key Lab of CAD&CG, Zhejiang University {zldong, zhangguofeng, bao}@cad.zju.edu.cn Abstract
We present a novel keyframe selection and recognition method for robust markerless real-time camera tracking. Our system contains an ofine module to select features from a group of reference images and an online module to match them to the input live video in order to quickly estimate the camera pose. The main contribution lies in constructing an optimal set of keyframes from the input reference images, which are required to approximately cover the entire space and at the same time minimize the content redundancy amongst the selected frames. This strategy not only greatly saves the computation, but also helps signicantly reduce the number of repeated features so as to improve the camera tracking quality. Our system also employs a parallel-computing scheme with multi-CPU hardware architecture. Experimental results show that our method dramatically enhances the computation efciency and eliminates the jittering artifacts.
1. Introduction
Vision-based camera tracking aims to estimate camera parameters, such as rotation and translation, based on the input images (or videos). It is a foundation for solving a wide spectrum of computer vision problems, e.g., 3D reconstruction, video registration and enhancement. Ofine camera tracking for uncalibrated image sequences can achieve accurate camera pose estimation [14, 23, 29] without requiring high efciency. Recently, real-time markless tracking [9, 28, 19, 18] has attracted much attention, as it nds many new applications in augmented reality, mobility, and robotics. This paper focuses on developing a practical realtime camera tracking system using the global localization (GL) scheme [28, 26], which involves an ofine process for space abstraction using features and an online step for feature matching. Specially, the ofine step extracts sparse invariant features from the captured reference images and uses
Correspondence
them to represent the scene. The 3D locations of these invariant features can be estimated by ofine structure-frommotion (SFM). Afterwards, taking these features as references, the successive online process is to match them with the features extracted from the captured live video for establishing correspondences and quickly estimating new camera poses. GL scheme is robust to fast movement because it matches features in a global way. This also precludes the possibility of error accumulation. It, however, has the following common problems in prior work. First, it is difcult to achieve real-time performance due to expensive feature extraction and matching, even in a relatively small working space. Second, these methods rely excessively on the feature distinctiveness, which cannot be guaranteed when the space scale is getting large or the scene contains repeated structures. It was observed that the matching reliability descends quickly when the number of features increases, which greatly affects the robustness and practicability of this system in camera tracking. In this paper, we solve the above efciency and reliability problems and develop a complete real-time tracking system using the GL scheme. Our contribution lies in the following three ways. First, we propose an effective keyframe-based tracking method to increase its practicability in general camera tracking for large scale scenes. A novel keyframe selection algorithm is proposed to effectively reduce the online matching ambiguity and redundancy. These keyframes are selected from all reference images to abstract the space with a few criteria: i) the keyframes should be able to approximate the original reference images and contain as many salient features as possible; ii) the common features among these frames are minimum in order to reduce the feature non-distinctiveness in matching; iii) the features should be distributed evenly in the keyframes such that given any new input frame in the same environment, the system can always nd sufcient feature correspondences and compute the accurate camera parameters. Second, with the extracted keyframes, in the real-time camera tracking stage, we contribute an extremely efcient keyframe recognition algorithm, which is able to nd ap-
propriate matching keyframes almost instantly from hundreds of candidates. Therefore, the computation can be greatly saved compared to the conventional global feature matching. Finally, we develop a parallel-computing framework for further speed-up. Realtime performance is yielded with all these contributions in our key-frame camera tracking system.
task when the camera undergoes complex motions. In our method, a set of optimal keyframes are selected for representing and abstracting a space. They are vital for efcient online feature matching. Feature-based Location Recognition There are approaches to employ the invariant features for object and location recognition [27, 24, 25, 7, 8, 1]. These methods typically extract invariant features for each image, and use them to estimate the similarity of different images. For effectively dealing with a large-scale image databases, a vocabulary tree [22, 5] was adopted to organize and search millions of feature descriptors. However, these methods do not extract sparse keyframes to reduce the data redundancy, and cannot be directly used for real-time tracking. The appearancebased SLAM method of Cummins and Newman [8] considers the inter-dependency among the features and applies the bag-of-words method within a probabilistic framework to increase the speed of location recognition. However, the computation cost is still high and it cannot achieve real-time tracking. Recently, Irschara et al. [15] propose a fast location recognition technique based on SFM point clouds. In order to reduce the 3D database size to improve recognition performance, synthetic views are involved to introduce a compressed 3D scene representation. In contrast, we propose constructing an optimal set of keyframes from the input reference images by minimizing an energy function, which establishes a good balance between the representation completeness and the redundancy reduction.
2. Related Work
We review camera tracking methods using keyframes and feature-based location recognition in this section. Real-time Camera Tracking In the past a few years, SLAM was extensively studied [11, 4, 10, 16, 17] and used for real-time camera tracking. SLAM methods estimate the environment structure and the camera trajectory online, under a highly nonlinear partial observation model. They typically use frame-to-frame matching and conning-searchregion strategies for rapid feature matching. As a result, they usually run fast. However, drifting and relocalisation problems could be produced with this scheme because it highly relies on the accuracy of past frames. Recent development mainly concentrated on improving the robustness and accuracy in a larger scale scene. The major issues of this scheme include relocalisation [32, 3, 17, 12] after camera lost, submap merging and switching [2], and the close loop management [31]. If 3D representation for the space is available, real-time camera tracking can be easier and more robust. Several markerless algorithms [30, 6] have been proposed to employ the objects CAD model to facilitate camera pose estimation. However, these CAD models are usually difcult, if not impossible, to be constructed. Skrypnyk and Lowe [28] proposed modeling natural scene using a set of sparse invariant features. The developed system contains two modules, i.e. the ofine feature-based scene modeling and online camera tracking. It runs at low frame rate due to expensive SIFT feature extraction and matching. It also highly relies on the distinctiveness of SIFT features, and is therefore limited to representing a relatively small space. This paper focuses on solving these problems in a two-stage tracking system. Keyframe-based Methods Keyframe selection is a common technique to reduce the data redundancy. In the realtime camera tracking method of Klein et al. [16], a set of online keyframes were selected, which facilitate the bundle adjustment for the 3D map recovery. In [17], the keyframes were used for relocalisation with simple image descriptors. For model-based tracking, Vacchetti et al. [30] selected keyframes manually in the ofine mode, and match each online frame to a keyframe with the closest visible area. In all these methods, keyframes are selected manually or by a simple procedure, which is not optimal for the tracking
3. Framework Overview
We rst give an overview of our framework in Table 1. It contains two modules, i.e. the ofine feature-based scene modeling and online camera tracking. The ofine module is responsible for processing the reference images and modeling space with sparse 3D points. In this stage, SIFT features [21] are rst detected from the reference images to establish the multi-view correspondences. Then we use the SFM method [33] to estimate the camera pose together with the 3D locations of the SIFT features.
1. Ofine space abstraction: 1.1 Extract SIFT features from the reference images, and recover their 3D positions by SFM. 1.2 Select optimal keyframes and construct a vocabulary tree for online keyframe recognition. 2. Online real-time camera tracking: 2.1 Extract SIFT features for each input frame of the incoming live video. 2.2 Quickly select candidate keyframes. 2.3 Feature matching with candidate keyframes. 2.4 Estimate camera pose with the matched features. Table 1. Framework Overview
In the online module, we estimate the camera parameters for each input frame in real-time given any captured live video in the same space. Instead of frame-by-frame matching using all reference images, in our approach, we select several optimal keyframes to represent the scene, and build a vocabulary tree for online keyframe recognition. For each online frame, we select appropriate candidate keyframes by a fast keyframe recognition algorithm. Then the live frame only needs to be compared with the candidate keyframes for feature matching. With the estimated 3D positions of all reference features on keyframes and sufcient 2D-3D correspondences for the live frame, the camera pose can be reliably estimated.
set spanned by X . It is notable that |f (X )|, for all X , must be larger than 1 because if one feature nds no match in other images, its 3D position cannot be determined. We denote f (X ) where |f (X ) l| as superior features and the ). l is set to 10 20 in set of the superior features as V (I our experiments. We dene the saliency of one SIFT feature as the combination of two factors, i.e. the match count of one feature in different reference images |f (X )| and the Difference-ofGaussian (DoG) strength, and write it as s(X ) = D(X ) min(|f (X )|, T ), (2)
where T is the truncated threshold to prevent a long track over-suppress others. It is set to 30 in our experiments. A large value in |f (X )| indicates a high condence of matching. D(X ) is expressed as D(X ) = 1 |f (X )| D i (x i ),
if (X )
where Di denotes the DoG map [21]. D(X ) represents the average DoG map for all features in f (X ). The larger D(X ) is, the higher saliency the feature set X has. Despite the above two measures, another important constraint to make real-time tracking reliable is the spatially near-uniform distribution of all features in the space. It is essential for nding sufcient matches with respect to input online frames in the same space. We dene the feature density d(yi ) for each pixel y in image i. Its computation is described in Algorithm 1. With the density maps for all images, we dene the set density as d(X ) = 1 |f (X )| d(xi ),
if (X )
where d(xi ) denotes the feature density of xi in image i. Algorithm 1 Feature density computation for image i 1. Initialize all densities to zeros in the map. 2. for j = 1, ..., m, % m is the feature number in image i for each pixel yi W (xj ), %W is a 31 31 window and %xj is the coordinate of feature j in image i d(yi ) += 1. Finally, our completeness term is dened as: Et (F ) = 1 (
X V (F )
where is a weight. The denitions of the two terms are described as follows.
(3) where controls the sensitivity to feature density. It is set to 3 in our experiments. V (F ) denotes the superior feature set in the keyframe set F .
) X V (I
s(X ) ), + d(X )
w0
List
...
root
...
List
...
w1 7
...
+w32 N3 2(7)
0 1 2 3
...
C(0)
leaf
w2
List
7 8 9
...
C(7)
...
w1 6
List
w1 8
List
...
w3 2
List
+w3 2 N3 2(21)
21 C(21)
...
vocabulary tree
keyframe voting
Figure 1. A vocabulary tree. All feature descriptors are originally in the root node, and are partitioned hierarchically. Each node has a weight to represent its distinctiveness.
)| is for normalization, |f (X ) F | computes where 1/|V (I the copies of features in both X and the keyframes. |f (X ) F | = 1 indicates no redundancy.
where K is the number of all keyframes. The node count |V | is determined by the branching factor b and tree depth l. In our experiments, we normally select 20 50 keyframes, and each keyframe extracts about 300 500 features. Therefore, we usually set b = 8, l = 5 in our experiments.
Algorithm 3 Candidate keyframe selection 1. Set the matching value C (k ) = 0 for each keyframe k . 2. For each online frame, the detected m features are matched from the root node to leafs in the vocabulary tree as follows: In each level, for each closest node i with weight wi > , for each k Li , C (k )+ = Ni (k ) wi . 3. Select K keyframes with largest C .
Module SIFT feature extraction Keyframe Recognition Keyframe-based matching Camera pose estimation Rendering
cause an ofine KD-tree is independently constructed for each keyframe, which can speed up matching. The outliers are rejected in our system by epipolar geometry between the online frame and keyframes using RANSAC [13]. To obtain more matches, we can utilize matched features on the previous frame by nding their correspondences on the current frame through local spatial searching [16]. Once all matches are found, since a feature in the reference keyframes corresponds to a 3D point in the space, we use these 2D-3D correspondences to estimate camera pose [28].
Keyframe Number 33 28 23 13 8
Table 3. The statistics of feature completeness and redundancy with different and keyframe numbers.
8
Recognition Time(ms)
Figure 2. Diagram of our system. It is divided to multiple parts connected by thread-safe buffers. Different components run on separate working threads, and are synchronized by the frame time stamp.
50
100
Keyframe Number
150
200
250
300
Figure 4. Time spent in keyframe recognition. The computation of the appearance-vector-based method [22] grows rapidly with the keyframe number while our method does not. The total running time of our method is short.
100
Global Matching Keyframebased Matching
80 Number of Matches
60
40
20
50
100
150
250
300
350
Figure 5. Comparison of feature matching. Our keyframe-based method yields much more reliable matches than global matching.
For demonstrating the effectiveness of our method, we also compare our keyframe-based matching with the method of [28]. In [28], a KD tree was constructed for all reference features. Each feature in a live frame is compared with those in the KD tree. We name this scheme global matching to distinguish it from our keyframe-based matching. Figure 5 compares the real-time matching quality between these two methods in the indoor cubicle example. It is measured by the number of correct matches in processing one online frame. As illustrated, our keyframe method yields much more reliable matches than the global method. With the KD-tree, the matching time of each feature is O(log M ) for global matching, where M is the total number of features. For our keyframe-based matching, the running time is only K O(log m), where m is the average feature number in each keyframe and K is the number of the candidate keyframes. For global matching, the computation time grows with
M . But for our keyframe-based method, its computation time is relatively more stable and does not grow with the total number of features. In our experiments, for each online frame, we extract about 300 SIFT features, and the global matching time is about 40ms (M = 6167) with a single working thread. Our method only uses 15ms with K = 4. The augmented result by inserting a virtual object is shown in our supplementary video. The jittering artifact when inserting an object into the live video is noticeable with global matching, but not with our keyframe-based matching. Figure 6 shows another example of an outdoor scene, containing many repeated similar structures. The scale of the space is relatively large. Figure 6(a) shows the recovered 61522 3D points. The selected 39 keyframes are shown in Figure 6(b), covering 53.7% of the superior features. Please refer to our supplementary video for augmented result and more examples. Klein and Murray [16, 17] employed online bundle adjustment (BA) with parallel computing to avoid ofine 3D reconstruction. This strategy, however, is not very suitable for our examples that are taken in relatively large scenes because SFM for such a workspace requires computationallyexpensive global BA which cannot be done online. We have tested our sequences using the publicly accessible code PTAM (http://www.robots.ox.ac.uk/gk/PTAM/). The input frame rate is set to 5fps to give enough time to the local BA thread, and each sequence is repeated for the global BA to converge. Even so, we found PTAM only succeeded in tracking the rst half part of the indoor cubicle sequence, and failed on the other two outdoor sequences. Readers are referred to our supplementary video for the detailed comparison.
(a) The recovered 3D feature points. (b) The keyframes viewed in 3D.
ages are sufcient for achieving high coverage of features in the input images. Compared to global matching, our method not only simplies feature matching and speeds it up, but also minimizes the matching ambiguity when the original images contain many non-distinctive features. It makes camera pose estimation robust. Our method still has limitations. If the camera moves to a place signicantly different from the training keyframes, the camera pose cannot be accurately estimated. In practice, this problem can be alleviated by capturing sufcient reference images to cover the space. In addition, this paper has demonstrated the effectiveness of the proposed keyframe selection and recognition methods. We believe it could be combined with other frame-to-frame tracking scheme, like SLAM, to further improve the efciency. So one possible direction of our future work is to reduce ofine processing by collecting online keyframes to update and extend space structure, and combine frame-to-frame tracking (such as the SLAM methods) to further improve the performance of our system in an even larger space. Employing GPU for further acceleration will be explored.
7. Conclusions
In this paper, we have presented an effective keyframebased real-time camera tracking. In the ofine stage, the keyframes are selected from the captured reference images based on a few criteria. For quick online matching, we introduce a fast keyframe candidate searching algorithm to avoid exhaustive frame-by-frame matching. Our experiments show that a small number of candidate reference im-
Acknowledgements
This work is supported by the 973 program of China (No. 2009CB320804), NSF of China (No. 60633070), the 863 program of China (No. 2007AA01Z326), and the Research Grants Council of the Hong Kong Special Administrative Region, under General Research Fund (Project No. 412708).
References
[1] A. Angeli, D. Filliat, S. Doncieux, and J.-A. Meyer. A fast and incremental method for loop-closure detection using bags of visual words. IEEE Transactions on Robotics, Special Issue on Visual Slam, October 2008. 2, 5 [2] R. O. Castle, G. Klein, and D. W. Murray. Video-rate localization in multiple maps for wearable augmented reality. In Proc 12th IEEE Int Symp on Wearable Computers, Pittsburgh PA, 2008. 2 [3] D. Chekhlov, W. Mayol-Cuevas, and A. Calway. Appearance based indexing for relocalisation in real-time visual slam. In 19th Bristish Machine Vision Conference, pages 363372. BMVA, September 2008. 2 [4] D. Chekhlov, M. Pupilli, W. Mayol, and A. Calway. Robust Real-Time Visual SLAM Using Scale Prediction and Exemplar Based Feature Description. In CVPR, pages 17, 2007. 2 [5] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Automatic query expansion with a generative feature model for object retrieval. In ICCV, pages 18, 2007. 2 [6] A. I. Comport, E. Marchand, M. Pressigout, and F. Chaumette. Real-time markerless tracking for augmented reality: The virtual visual servoing framework. IEEE Transactions on Visualization and Computer Graphics, 12(4):615628, July-August 2006. 2 [7] M. Cummins and P. Newman. Fab-map: Probabilistic localization and mapping in the space of appearance. Int. J. Rob. Res., 27(6):647665, 2008. 2 [8] M. J. Cummins and P. Newman. Accelerated appearanceonly slam. In ICRA, pages 18281833, 2008. 2 [9] A. J. Davison. Real-time simultaneous localisation and mapping with a single camera. In ICCV, pages 14031410, 2003. 1 [10] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. MonoSLAM: Real-time single camera SLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):10521067, 2007. 2 [11] E. Eade and T. Drummond. Scalable monocular slam. In CVPR (1), pages 469476, 2006. 2 [12] E. Eade and T. Drummond. Unied loop closing and recovery for real time monocular slam. In British Machine Vision Conference (BMVC), 2008. 2 [13] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model tting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381395, 1981. 5 [14] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004. 1 [15] A. Irschara, C. Zach, J.-M. Frahm, and H. Bischof. From structure-from-motion point clouds to fast location recognition. In CVPR, 2009. 2 [16] G. Klein and D. Murray. Parallel tracking and mapping for small AR workspaces. In ISMAR 2007, pages 225234, Nov. 2007. 2, 5, 7
[17] G. Klein and D. Murray. Improving the agility of keyframebased slam. In ECCV, volume 2, pages 802815, 2008. 2, 7 [18] T. Lee and T. H ollerer. Hybrid feature tracking and user interaction for markerless augmented reality. In VR, pages 145152, 2008. 1 [19] V. Lepetit and P. Fua. Monocular model-based 3D tracking of rigid objects. Found. Trends. Comput. Graph. Vis., 1(1):1 89, 2005. 1 [20] T. Liu and J. R. Kender. Optimization algorithms for the selection of key frame sequences of variable length. In ECCV (4), pages 403417, 2002. 4 [21] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91110, 2004. 2, 3 [22] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR, pages 21612168, Washington, DC, USA, 2006. IEEE Computer Society. 2, 4, 5, 6 [23] M. Pollefeys, L. V. Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and R. Koch. Visual modeling with a handheld camera. International Journal of Computer Vision, 59(3):207232, 2004. 1 [24] F. Schaffalitzky and A. Zisserman. Automated location matching in movies. Computer Vision and Image Understanding, 92(2-3):236264, 2003. 2 [25] G. Schindler, M. Brown, and R. Szeliski. City-scale location recognition. In CVPR, 2007. 2, 4, 5 [26] S. Se, D. Lowe, and J. Little. Vision-based global localization and mapping for mobile robots. IEEE Transactions on Robotics, 21:364 375, 2005. 1 [27] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In ICCV, page 1470, Washington, DC, USA, 2003. IEEE Computer Society. 2 [28] I. Skrypnyk and D. G. Lowe. Scene modelling, recognition and tracking with invariant image features. In ISMAR 04: Proceedings of the 3rd IEEE/ACM International Symposium on Mixed and Augmented Reality, pages 110119, Washington, DC, USA, 2004. IEEE Computer Society. 1, 2, 5, 6 [29] N. Snavely, S. M. Seitz, and R. Szeliski. Modeling the world from Internet photo collections. International Journal of Computer Vision, 80(2):189210, November 2008. 1 [30] L. Vacchetti, V. Lepetit, and P. Fua. Combining edge and texture information for real-time accurate 3D camera tracking. In Third IEEE and ACM International Symposium on Mixed and Augmented Reality, pages 4857, Arlington, Virginia, Nov 2004. 2 [31] B. Williams, J. Cummins, M. Neira, P. Newman, I. Reid, and J. Tardos. An image-to-map loop closing method for monocular SLAM. In Proc. International Conference on Intelligent Robots and and Systems, 2008. 2 [32] B. Williams, G. Klein, and I. Reid. Real-Time SLAM Relocalisation. In ICCV, pages 18, 2007. 2 [33] G. Zhang, X. Qin, W. Hua, T.-T. Wong, P.-A. Heng, and H. Bao. Robust metric reconstruction from challenging video sequences. In CVPR, 2007. 2