Computer Vision
Computer Vision
Abstract— We present a multi-sensor camera tracking method and the reconstruction of the scene in parallel. Thus, it is
for a real-time 3-D reconstruction on mobile devices. Our structrually very similar to simultaneous localization and
approach combines the iterative closest point (ICP) pose esti- mapping (SLAM), which also can involve ToF cameras [10].
mation using the low-resolution range maps delivered by a PMD
pico-flexx time-of-flight camera with the 6-DOF pose estimates 3D reconstruction and SLAM, however, have different foci:
given by inertial tracking. In contrast to prior approaches, we do While 3D scene reconstruction aims for high geometric quality
not rely on additional sensors, such as 2-D visual odometry. of the reconstructed scene, in SLAM trajectory and pose
We fuse the results of the inertial tracking with the extrapolated estimation have the highest priority. In this paper, we mainly
ICP pose estimates using an extended Kalman filter (EKF). address 3D scene reconstruction, a survey of visual SLAM
Subsequently, the output of the EKF is used as initial guess for the
next ICP-based pose estimation. This approach yields an efficient approaches can be found, e.g., in [11].
and robust pose tracking for the spatially and temporally low- Commonly, in 3D scene reconstruction the camera pose
resolution range data given in mobile applications, and, at the estimation is formulated as a registration problem. Given a
same time, it results in a consistent geometric reconstruction, sequence of overlapping depth maps delivered by the range
as the final pose minimizes the error with respect to the scene camera, the camera pose is estimated by finding the best
geometry.
alignment between two successive depth maps (frame-to-frame
Index Terms— Scene reconstruction, mobile, tracking, registration) or between the current frame with the recon-
extended Kalman filter. structed model of all preceding frames (frame-to-model align-
ment). There are various registration approaches, e.g. iterative
I. I NTRODUCTION closest point (ICP), RANSAC, PCA, or cross-correlation [12],
[13]; see Salvi et al. [14] for an overview. In online scene
R EAL-time 3D scene reconstruction from depth data is a
well-established research area where several approaches
have been proposed [1]–[3]. However, because of high com-
reconstruction, the ICP algorithm is commonly applied.
Camera tracking based on, e.g., ICP, is a task demanding
putational requirements and due to the absence of highly high computing and data cost even in the standard, desktop
integrated range cameras, their implementation was limited 3D scene reconstruction pipelines. Since the accuracy of 3D
for a long time to highend platforms including robots that map registration heavily depends on the size of overlapping
involving PCs or laptops and rather energy intensive and areas and the magnitude of relative transformation that has
bulky range cameras such as the Microsoft Kinect. Alternative to be estimated, it requires both, significant CPU and GPU
approaches utilize the standard colour camera of a mobile capabilities and high spatial and temporal resolution of the
device, i.e. smartphone or tablet, in order to extract 3D depth data. However, mobile devices with highly integrated
information from RGB image streams [4]–[7]. In the last years, ToF cameras comprise restricted computational resources and
however, the availability of highly integrated Time-of-Flight significantly lower temporal and spatial resolution (see Tab. I),
(ToF) depth cameras such as the Real3™ areasensor from infi- making high quality online 3D scene reconstruction hard to
neon [8] integrated in ToF cameras such as the picoflex from achieve.
pmd-technologies [9] or in mobile devices such as Lenovo’s On the other hand, online scene reconstruction can benefit
PHAB2 Pro, makes mobile 3D reconstruction possible. from additional sensory information, e.g. from an inertial
The 3D scene reconstruction problem requires the measurement unit (IMU). A common approach to embed these
estimation of the camera’s pose (position and orientation) motion data is to provide them as initial guess for geometric
registration, i.e. to ICP. This initial guess can be either obtained
Manuscript received December 15, 2017; revised January 18, 2018; accepted by mere integration of the IMU data [15], or by sensor fusion,
January 31, 2018. Date of publication February 5, 2018; date of current version e.g. with an extended Kalman filter (EKF), that delivers more
March 9, 2018. This work was supported in part by the German Research
Foundation in the context of the Collaborative Research Center (SFB) robust pose predictions. However, because the IMU lacks any
1187 Media of Cooperation, and in part by the sub-project A06 Visual complementary data source for position estimation, fusion
integrated medical cooperation. The associate editor coordinating the review algorithms require either a third sensory input from, e.g.,
of this paper and approving it for publication was Dr. Rosario Morello.
(Corresponding author: Dmitri Presnov.) a camera-based feature tracker [16] or they are restricted to the
The authors are with the Computer Graphics Group, Center for rotational component only [17]. Some approaches apply a final
Sensor Systems, University of Siegen, Siegen 57068, Germany (e-mail: fusion of the ICP results with other sensory information. This
[email protected]; [email protected]; andreas.kolb@
uni-siegen.de). approach, however, is only benefical, if the ICP pose is less
Digital Object Identifier 10.1109/JSEN.2018.2801878 reliable, leading to unwanted geometric reconstruction errors.
1558-1748 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
2904 IEEE SENSORS JOURNAL, VOL. 18, NO. 7, APRIL 1, 2018
TABLE I
K EY M EASURES OF A T YPICAL D ESKTOP P LATFORM U SED IN R ECENT P UBLICATIONS , E . G . [23], AND M OBILE D EVICES (T EGRA K1). N OTE T HAT
THE P ERFORMANCE AND F ILL -R ATE F IGURES A RE I MPERFECT, AS T HERE A RE N O S TANDARDIZED A CQUISITION P ROCESSES
In this paper we propose a lightweight solution for real-time Klingensmith et al. [18] incorporate VIO from a Tango
3D reconstruction on mobile devices. In order to address mobile device, which fuses inertial data with visual odometry,
the challenges arising from a low temporal and spatial depth based on feature tracking using a fish-eye camera. The low
resolution and the limited computational resolution, we pro- depth frame rate of the range camera of ≈ 3 − 6 Hz doesn’t
pose a novel multi-sensor tracking, solely using IMU as allow for standard ICP-based pose estimation. Thus, the ICP
additional sensory input. In particular, our approach comprises is initialized with the pose from Tango VIO and refined
the following contributions: incrementally. The approach is designed for large scale scene
• We adapt the point-based framework as proposed by reconstructions with low to moderate reconstruction resolution
Keller et al. [3] for mobile online scene reconstruction. of 2−3 cm and yields camera drift of ≈ 5 m in the case of long
• For initialization of the ICP-based registration, we pro- trajectories (≈ 175 m). In [16] the authors further improved
pose a novel EKF-based fusion approach that robustly their approach by applying a corrective transformation to the
estimates rotational and positional pose information using Tango-based pose estimation for ICP initialization in order to
the IMU sensor in combination with extrapolated ICP compensate the VIO drift. The correction term is updated with
pose estimations as virtual measurements. each depth frame using the difference between the initial guess
• We demonstrate that our approach minimizes the risk and the final ICP pose.
of an ICP failure, particularly in case of a fast camera Huai et al. [19] combine inertial data with ICP and SIFT
motion. odometry on a heavy-weight mobile platform consisting of a
notebook and a Microsoft Kinect range camera. The integra-
tion of IMU readings provides a predicted camera orientation,
II. R ELATED W ORK
which is used for initialization and validity check of the ICP.
The integration of the inertial tracking and 2D imagery for In case ICP fails, SIFT odometry is used. If both ICP and
3D reconstruction on mobile devices has been investigated SIFT odometry fail, the incremental motion from the inertial
in several prior works. In the context of scene reconstruc- tracking is used as final pose estimation. An EKF is used to
tion based on motion stereo, inertial tracking has been used correct the IMU-based predictions, where ICP or SIFT odom-
for scene scale estimation and bridging the temporal gap etry serve as measurements for position correction, whereas
between the two camera frames which are asynchronously the states with a small acceleration are utilized for roll and
acquired [4] or as initial guess of the current pose within pitch correction by means of the gravity direction from the
the scope of a photo-consistency-based tracking [5]. In [7] an accelerometer.
online estimation of the relative rotation between two camera In the context of SLAM, Chow et al. [20] describe a 3D ter-
frames by integration the gyroscope data only, combined with restrial LiDAR system that integrates a MEMS IMU and two
the information from a 2D feature point tracker, is used for Microsoft Kinect sensors to map indoor urban environments
an offline structure-from-motion algorithm. Other approaches in a stop-and-go fashion. The pose estimation is achieved
use visual inertial odometry (VIO) as filter-based fusion of by an implicit iterative extended Kalman filter (IEKF) that
inertial and feature point tracking data, e.g. VIO from the predicts poses integrating IMU data and gets measurements
Tango library [6]. from visual tracking. The latter uses alternatively ICP-based
In the context of scene reconstruction from depth data, point cloud matching initialized by sparse 3D feature point
several approaches use additional sensory information, such matching (during the “going” state), LiDAR data (during the
as inertial tracking and visual tracking, in order to improve “stopped” state) or 5-point monocular visual odometry (VO) in
camera pose estimation. highly textured areas or regions with few depth features. The
Kähler et al. [17] estimate rotational and translational IMU-based pose predictions are also used for initialization of
components of the camera pose separately from different visual tracking.
tracking sources. The IMU data in connection with a fusion To the best of our knowledge, current approaches solely
algorithm are used for orientation estimation whereby ICP and relying on IMU data as additional sensory input are restricted
the colour-based tracker only optimize position. This way the to non-lightweight platforms and non-integrated range cameras
rotational drift is reduced. such as the Microsoft Kinect. Niessner et al. [15] combine the
PRESNOV et al.: ROBUST RANGE CAMERA POSE ESTIMATION FOR MOBILE ONLINE SCENE RECONSTRUCTION 2905
where VM,i−1 is the model vertex map in the camera IV. M ULTI -S ENSOR C AMERA T RACKING W ITH EKF
coordinates of frame i − 1 and û the estimated pixel
coordinates of the model point corresponding to Vi (u). ICP is known to produce accurate estimations on the condi-
This error function measures the distance between the tion that transformations between successive camera poses are
incoming point and the tangential plane of the model at relatively small. However, larger transformations can lead to
the corresponding model point. false results due to convergence in a local minimum, or even
The ICP algorithm solves a highly non-linear optimiza- to a failure due to false correspondes. On a mobile system
tion problem. Thus, its successful convergence heavily the probability of such critically large frame-to-frame trans-
depends on a good initial guess for the camera pose. formations increases because of a lower range camera frame
In the case of high temporal and spatial resolution range rate and longer processing times due to limited computational
cameras such as the Microsoft Kinect, a simple initial- resources. The latter may lead to the situation that not all
ization, such as the identity matrix [2], is fully sufficient. camera frames can be processed in real time, which results
In our case, i.e. for a low temporal and spatial range in frame dropping and, consequently, reduces the effective
camera resolution, this leads to misalignments, i.e. the frame rate even more. Our camera tracking approach improves
optimization results in a local minimum, which yields robustness by incorporating inertial sensor data in order to
significant geometric artifacts (e.g. ghost geometries). predict the camera pose that is used as ICP initialization. As a
In order to cope with faster motions, the ICP is com- consequence, our approach speeds up convergence as it allows
monly applied in a hierarchical manner [2]. By setting reducing the amount of image pyramid levels.
up an image pyramid, coarser version of the underlying Since integrating noisy IMU data quickly leads to error
matching problem are deduced. Solving this hierarchical accumulation, the results of inertial tracking are commonly
optimization in a coarse-to-fine approach leads to a fused with data from another tracking source, e.g. from a fea-
refinement of the camera pose which can overcome local ture point-based visual odometry. An often used sensor fusion
minima to some extend. algorithm is EKF. In our approach, however, we deliberately
Model Update: After camera pose estimation, the currently do not incorporate any additional tracking device in order
incoming range data is fused with the so-far accumu- to keep hardware efforts as low as possible. Alternatively,
lated model data. In the case of point-based geometry we assume moderate change rates in human motion and
representation, as proposed by Keller et al. [3], the final take advantage of the available ICP camera pose estimations.
correspondences are used and the following attributes are In particular, we predict future camera poses by extrapolating
averaged into the model: position, normal, and point size. ICP pose estimates from previous frames and take them as
In order to cope with outliers and isolated points, “virtual measurements” that are fused with the IMU integra-
the model points have a confidence attribute, which tion results using an EKF with the objective to compensate
counts the number of merges, i.e. observations. Points the errors due to the high-frequency IMU noise.
are tagged as “unstable” or “stable”, if their confidence Consequently, our EKF approach aims at estimating ICP
counter is below or above a given threshold, respectively. results that is considered as true states. Since the true states
Furthermore, incoming points observed from larger dis- corresponding to the camera frame i are known, it is preferable
tances than the corresponding model point are withdrawn to reset the filter to these values and estimate the next
from accumulation, as they are less reliable than the frame-to-frame transformation T(i+1)→i only, thus keeping the
accumulated observations. error accumulation as small as possible. However, due to a
processing delay in the reconstruction pipeline the last ICP
B. Implementation Details result TiICP is not available at ti , i.e., at the beginning of the
In this paper, we basically use the point-based online scene next inter-frame interval i → (i + 1). We address this issue by
reconstruction framework proposed by [3], [23]. Due to its splitting the rotational part of the camera pose Ri+1 in the prior
memory efficiency, this approach does not impose specific frame orientation Ri (base orientation) and the current frame-
modifications with respect to memory or algorithmic layout to-frame rotation R(i+1)→i , i.e., Ri+1 = Ri · R(i+1)→i . The
when porting to a mobile platform. Still, there are some EKF starts for each interval i → (i + 1) with zero frame-to-
adaptions required in order to use it in the case of range frame rotation and translation and with an estimate R i without
cameras with low temporal and spatial resolution, such as the waiting for the ICP result. These states are propagated by
picoflexx [9]. Highly integrated ToF cameras are designed for integration of incoming IMU data within the interval. At ti+1 ,
near-range and wide-field-of-view types of applications. The i.e., at the end of the interval, the true previous orientation
wide field of view leads to severe camera distortions which RiICP is always available and can be used in the measurement
need to be accounted for by camera calibration and in the vector that is passed to the EKF for correction of the predicted
pre-processing of the range maps. Furthermore, wide field states. Due to the split representation of the orientation, it is
of view in combination with low spatial resolution leads to possible to weight each of its components separately by setting
points with larger sizes and higher noise even at moderate appropriate values in the noise covariance matrices and, thus,
camera-to-object distances and, in average, to fewer model to control their influence on the corrector. A generic scheme
point observations. This effect needs to be taken care of by of the tracking workflow is shown in Fig. 2.
adapting the system thresholds for, e.g., the confidence counter The (partial) filter reset after each inter-frame interval and
and the ICP convergence. the weighting of the orientation components according to their
PRESNOV et al.: ROBUST RANGE CAMERA POSE ESTIMATION FOR MOBILE ONLINE SCENE RECONSTRUCTION 2907
Fig. 3. Examples of estimated pdf of virtual noise. (a) δa.x. (b) δw.x.
In order to calculate zi , first the velocity ṽi , the acceleration pose Ti+1ICP , we can retrospectively calculate the true values
ãi and the angular velocity ω̃i (all in global coordinates) are vi , ai and ωi substituting the left parts of the Eq. (19)-(21)
estimated with a backward Euler method. with ICP results and solving them for the respective motion
parameters. Then, the current deviations from the true values,
(pICP − pICP
i−1 )/ti−1 , if i > 0
ṽi = i (16) normalized by the time intervals, can be obtained as
0, else,
vi = (ṽi − vi )/ti (23)
(ṽi − ṽi−1 )/ti−1 , if i > 1 ai = (ãi − ai )/ti (24)
ãi = (17)
0, else, ωi = (ω̃i − ωi )/ti . (25)
We applied the above calculations to the outcomes of our
2
(qICP − qi−1 ICP ) ⊗ q∗ICP ), if i > 0
q̃ω,i = ti−1 T i i−1
(18) three scene reconstruction experiments in order to estimate
(1, 0 ), else, the distributions of v, a and ω (see Fig. 3(a)-3(b)). The
results demonstrate that they follow a Gaussian distribution
where pICP , qICP are the translational and rotational part of the
with zero mean, which justifies modeling them as white
respective ICP camera pose estimates, ti is the time between
Gaussian noise.
ICP frames i and i + 1, and q̃ω = (0, ω̃) contains the angular
Even though, both noise covariance matrices are concep-
velocity ω̃ as its imagery vector component.
tually related to real-world physical noise measures, we treat
After that, the measurements are calculated with an explicit
them as filter tuning parameters by setting them as constant,
Euler method:
diagonal matrices. The matrix values are determined experi-
1
e
qi+1 = qi∗I C P ⊗ ( ti q̃ω,i ⊗ qiICP ), (19) mentally by manual optimization with respect to filter response
2 and noise rejection as described hereafter. This kind of sim-
e
vi+1 = ṽi + ti ãi , (20) plification is repeatedly applied in the context of inertial
pei+1 = pICP
i + ti ṽi . (21) navigation, e.g. [27], and it yields satisfactory precision at
minimal computational costs, i.e. minimal time delays for the
The measurement function h(·) is linear and only copies the
overall system (see Sec. V-B).
a-priori estimates:
The values of the noise covariance matrix Q are determined
− −
h(x̂k+1 ) = x̂k+1 . (22) taking the following considerations into account. The integra-
tion of angular velocity (after bias subtraction) delivers rather
precise orientation estimations. The velocity estimation, and,
C. Noise Modeling even worse, the position estimates, are less precise due to
Besides bias, whose treatment is explained in IV-A, the IMU the high-frequency noise in the acceleration values and the
measurements are perturbed by white noise w, which is double integration. After manual optimization, we set Q’s
commonly modeled as Gaussian [26]. w is considered in the diagonal matrix elements that correspond to q, v, p and qb to
process noise covariance matrix Q. 10−5 , 10−2 , 10−1 and 10−5 , respectively. The measurement
The noise in the virtual measurements arises from uncertain- noise covariance matrix R is set in an analogous manner.
ties in the estimation of the motion parameters (Eq. (16)-(18)) Initially, we conducted some simulation experiments in order
used by the extrapolation. More precisely, the linear and to get more insight into the precision of the ICP extrapolation
angular velocities, estimated over an interval (i − 1) → i , values. We found a higher precision for positional and lower
and the acceleration over (i − 2) → i , are assumed to be precision for rotational values. The base orientation from ICP,
constant over the next interval i → (i + 1), which generally interpreted as true state, gets the highest confidence, i.e. the
is an approximation and not the true state. Having the camera smallest noise value. After manual optimization, the R’s values
PRESNOV et al.: ROBUST RANGE CAMERA POSE ESTIMATION FOR MOBILE ONLINE SCENE RECONSTRUCTION 2909
Fig. 4. The test scenes used for evaluation. (a) Hen scene and Tango tablet with mounted picoflexx camboard in start position. (b) Hen Duplo scene.
(c) Office scene.
EKF
the result T̂(i+1) →i is available for the ICP pipeline module.
At ti+1 , i.e. at the end of the interval, the virtual measurements
are computed via ICP extrapolation (Eq. (19)-(21)) and the
EKF correction is executed. Due to the linearity of h(·) the
Kalman gain equation can be written in a simplified form:
− −
K k+1 = Pk+1 (Pk+1 + R)−1 . (26)
Knowing the Kalman gain, the a-posteriori estimates of the
system states and of the error covariance matrix are computed:
+ − −
x̂m i
= x̂m i
+ K m i (zm i − x̂m i
), (27)
Pm+i = (I − K m i )Pm−i , (28)
where m i is the last IMU sample index in the current inter-
frame interval i → (i + 1).
Fig. 5. Reconstruction of the geometric resolution target. Space widths left Than the current camera pose estimate is determined as
to right: 22 mm, 14.5 mm, 9 mm, 6 mm, 4 mm, 2.5 mm and 1.5 mm.
(a) Reconstructed target. i+1 = pi
p̂Init ICP
+ p̂+
mi , (29)
Init +
q̂i+1 = qiICP ⊗ q̂m i
. (30)
that correspond to qe , ve , pe and qbICP are set to 10−3 , 10−6 , After the submission of the pose frame-to-frame component
10−5 and 10−13 , respectively. to the ICP module, the EKF is reset for the next inter-frame
From the perspective of accuracy and precision, more interval (i + 1) → (i + 2). Let
sophisticated approaches to calculate Q and R incorporating, −
x̂0i+1 = g(x̂m , qiICP ) (31)
e.g., measured IMU noise values and dynamic adaptation, i
are promising. However, besides the additional work load be the reset function that re-initializes the state vector for the
and resulting loss in temporal performance, the asymmetrical interval (i + 1 → i + 2) and operates as:
design of our EKF, i.e. the unequal number of propagation and q̂0i+1 = (1, 0T ) (32)
correction steps, is a major challenge for conceptual design +
and implementation of these kinds of filters. For instance, v̂0i+1 = v̂m i
(33)
a commonly used adaptive approach applies noise covariance p̂0i+1 = 0 (34)
+
matching by means of innovation or residual covariances q̂b,0i+1 = qiICP ⊗ q̂m i
(35)
during each correction step [28], [29]. These covariances are
Consequently, the errors in the frame-to-frame translation and
estimated within a sliding window of the n latest measurement
rotation are not propagated to the next interval and the base
updates, and it is assumed that the error covariance matrix P
orientation includes only an error from the last frame-to-frame
(Eq. 11) is constant within this temporal window [30]. This
rotation estimate. Finally, the error covariance matrix P is
assumption, however, is unjustifiable in our case, as ≈ 10
re-initialized:
system propagations that update P (see Eq. 12) occur between
+
two consecutive measurement updates. In future work, we will P0 = Jg (x̂m i
)Pm+i Jg (x̂m
+ T
i
) , (36)
investigate more sophisticated, adaptive filter approaches that
where
lead to improved filter response and noise rejection at reason-
+ ∂g
able computational costs. Jg (x̂m ) = + (37)
i
∂ x̂m i qICP
i
D. Correction and Filter Reset + ). Since the part which relates to
is the Jacobian of g(x̂m i
Once a new camera frame i + 1 initiates a new cycle of the the frame-to-frame translation and rotation is cleared to zero,
reconstruction pipeline (see Fig. 1), the estimation of the pose we reinitialize the corresponding diagonal values like in
transformation for the interval i → (i + 1) is terminated and Eq. (14).
2910 IEEE SENSORS JOURNAL, VOL. 18, NO. 7, APRIL 1, 2018
TABLE II
H EN S CENARIO : T HE ROTATIONAL AND P OSITIONAL D RIFT E RRORS , AS W ELL AS THE T OTAL N UMBER OF I TERATIONS
FOR THE ICP AND THE ICP + IMU M ETHOD FOR D IFFERENT F RAME D ROPPING (A LL , 1/2, 1/4) AND D IFFERENT
P YRAMID L EVEL (l = 3, l = 2, l = 1). ∅ I NDICATES ICP FAILURE
TABLE III
H EN _D UPLO S CENARIO : T HE ROTATIONAL AND P OSITIONAL D RIFT E RRORS , AS W ELL AS THE T OTAL N UMBER OF I TERATIONS
FOR THE ICP AND THE ICP + IMU M ETHOD FOR D IFFERENT F RAME D ROPPING (A LL , 1/2) AND D IFFERENT
P YRAMID L EVEL (l = 3, l = 2, l = 1). ∅ I NDICATES ICP FAILURE
TABLE IV
O FFICE S CENARIO : T HE ROTATIONAL AND P OSITIONAL D RIFT E RRORS , AS W ELL AS THE T OTAL N UMBER OF I TERATIONS
FOR THE ICP AND THE ICP + IMU M ETHOD FOR D IFFERENT F RAME D ROPPING (A LL , 1/2) AND D IFFERENT
P YRAMID L EVEL (l = 3, l = 2, l = 1). ∅ I NDICATES ICP FAILURE
TABLE V
H EN _S LOW S CENARIO : T HE ROTATIONAL AND P OSITIONAL D RIFT E RRORS , AS W ELL AS THE T OTAL N UMBER OF I TERATIONS
FOR THE ICP AND THE ICP + IMU M ETHOD FOR D IFFERENT F RAME D ROPPING (A LL , 1/2, 1/4) AND D IFFERENT
P YRAMID L EVEL (l = 3, l = 2, l = 1). ∅ I NDICATES ICP FAILURE
Fig. 6. The Office scene reconstruction results. (a) reconstructed with ICP + IMU method, 1/2 frames, l = 1. (b) shows ghost artifacts by the reconstruction
with ICP method, All frames, l = 3.
The range data are provided by an external CamBoard In order to calculate the complete calibration matrix we use
picoflexx fixed on the tablet and connected via USB. The the transformation between RGB camera and IMU from the
picoflexx provides ≈ 15 FPS at a resolution of 224 × 171 px. Tango library.
For a comparison of principle characteristics of the equip- We demonstrate an improvement of tracking robustness by
ments used in our handheld system and in a recent desktop our method applying extreme scenarios with fast handheld
solution [23], they are summarized in the Tab. I. camera motions and abrupt directional changes. In the exper-
In order to evaluate the achievable accuracy with the given iments we use three different scenes (see Fig. 4(a)-4(c)), for
system, we designed a geometric resolution target consisting each of them we acquired a data set consisting of range and
of bars and spaces of different widths (see Fig. 5(a)). The IMU data. The first scene, with relatively sparse geometric
width decreases from left to right: 22 mm, 14.5 mm, 9 mm, details, is composed of a gypsum hen figure and two cubes
6 mm, 4 mm, 2.5 mm and 1.5 mm. We capture the target from with gypsum eggs thereon (Hen). For the second data set Hen
a distance of 0.5 m, which corresponds to a theoretical point was extended with some Lego Duplo figures (Hen_Duplo).
diameter of ≈ 2.5 mm (see Tab. I). Fig. 5(a) shows the result The third scene shows an office desk (Office). The data
of the target reconstruction. The two largest spaces are clearly was acquired by moving the camera around the respective
recognisable, the next two are partially closed and the space scene with loop closure. In addition, we recorded the (Hen)
of 4 mm is only distinguishable as a groove without holes. The scene with slower, smoother camera movements (Hen_Slow)
space smaller than 4 mm cannot be recognized. This result is in order to evaluate the system under moderate motion
expected, as the imaging system’s optics, e.g. it’s point-spread conditions.
function, limits the system resolution. We compare the result of our tracking method (ICP + IMU)
The depth camera is calibrated against the integrated RGB with the ICP initialized with identity matrix (ICP) in consid-
camera of the tablet using standard calibration methods [31]. eration of the following aspects.
2912 IEEE SENSORS JOURNAL, VOL. 18, NO. 7, APRIL 1, 2018
Fig. 7. Distance quality of EKF estimate Init_EKF comparing with the standard initialization approach Init_Id for the Office scenario with frame
dropping All, 1/2.
Fig. 8. Angular quality of EKF estimate Init_EKF comparing with the standard initialization approach Init_Id for the Office scenario with frame
dropping All, 1/2.
Levels of Image Pyramid: Executing the ICP on several Thus, in real time the pipeline is unable to process all
levels of the input image stabilizes the camera pose data and has to drop some frames. Such an online frame
estimation; see Sec. III-A. However, this hierarchical dropping depends on different internal factors of the oper-
approach requries additional computations and mem- ating system and is therefore unpredictable. In order to
ory. We demonstrate that due to the enhanced initial achieve reproducible results, we used a controlled frame
guess by ICP + IMU, we obtain good correspondence dropping approach in our experiments. This approach
pairs already at the full image resolution, thus decreas- processes a predefined set of depth images, independently
ing the overall computation and memory requirements. of the processing time of the mobile device, leading to test
In our experiments we use l = 3, l = 2 and l = 1 pyramid sequences All, 1/2 and 1/4 that contain all, each second
levels. and each forth depth frame, respectively. The frame drop-
Frame Dropping: The throughput of the reconstruction ping enlarges pose transformation between consecutive
pipeline on the given mobile hardware is ≈ 8 FPS while processed frames and leads to similar inputs as capturing
the picoflexx delivers ≈ 15 depth frames per second. the same motion at a higher speed.
PRESNOV et al.: ROBUST RANGE CAMERA POSE ESTIMATION FOR MOBILE ONLINE SCENE RECONSTRUCTION 2913
In order to evaluate the overall tracking precision and global dropping 1/2 the pose estimation on the basis of original image
drift errors, we setup the camera path as loop-closures, i.e. resolution (l = 1) was possible.
with the same start and end pose. Thus, we can calculate On the other hand, in the experiment with a slow camera
the positional and rotational drift for each test sequence and motion (Tab. V), suitable for scene reconstruction with ICP,
pose estimation method as absolute differences between the the application of ICP + IMU doesn’t deteriorate tracking
first and final estimated position and orientation respectively results and exhibits higher robustness under the highest frame
(see Tab. II-V). Large drift errors indicate unreliable pose esti- dropping (1/4) with a reduced image pyramid (l = 2, l = 1).
mations: in cases where they exceed the specified thresholds As can be seen in Tab. III and V, in those experiments where
( 0.15 m for positional and 10◦ for rotational error), we place both methods have delivered reliable results, the mean iteration
the respective results in parentheses, even though ICP itself number in ICP + IMU is only marginally below the one of
did not fail, i.e. no matrix singularity occurred. ICP. In general, however, ICP + IMU achieves a robust result
A reduction of image pyramid does not automatically lead with considerable fewer overall iterations when fewer pyramid
to a smaller total number of ICP iterations since a bad initial levels are used. As expected, the drift errors are comparable
guess can slow down the convergence on lower pyramid levels. in cases where ICP and ICP + IMU deliver reliable results.
In order to demonstrate the impact of pyramid reduction on Concerning the difference between the initial guess and final
the computational effort, we consider for each experiment the pose, Fig. 7 and 8 show the absolute errors for a sequence
mean and the standard deviation of the total iteration number. of ≈ 130 frames of the Office scenario with l = 1 for frame
Finally, we evaluate the enhancement of the initial guess dropping All and 1/2. In the vast majority of the frames
by our method considering the difference in position and there is a significant improvement by the EKF prediction. The
orientation between the EKF estimate, i.e., the ICP’s initial overall relative error Init_EKF/Init_Id for the range image
pose, and the final pose after ICP (Init_EKF) and comparing sequence of All frames is 0.5392±0.6965 for the mean relative
it with the respective difference between the last and current positional and 0.4556±0.3697 for the mean relative rotational
ICP pose (Init_Id), which corresponds to the common ICP error. Dropping half of the frames (1/2), we get a mean relative
initialization with the identity matrix. position error of 0.4348±0.4546 and a mean relative rotational
Besides the demonstration of an improvement over the error of 0.3283 ± 0.2876. Thus, for faster motion we get
standard ICP initialization, we also compare our method with relatively better initializations using our method.
the Tango VIO-based initialization approach proposed in [16]. As a comparison, the Tango-based initialization according
For this purpose, we saved the respective pose from Tango to [16] produces with the frame sequence 1/2 a mean relative
tracking for each incoming depth image during the Office error Init_Tango/Init_Id of 0.2926 ± 0.3147 in position
scene capturing. After the scene is reconstructed, we calculate and 0.4349 ± 0.4659 in orientation. The results demonstrate
the pose predictions offline, using the stored Tango data and that our method is only slightly less precise in position
the ICP results. prediction, and at the same time has a slightly higher precision
Again, we obtain differences between Tango-based predic- in orientation prediction. In summary, we can obtain an
tions and ICP poses (Init_Tango). initial guess precision comparable with the above VIO method
without requiring a fisheye camera with large opening angle
and high temporal and spatial resolution.
B. Evaluation The processing of an individual input depth frame according
The experimental results in Tab. II-IV show a considerable to Fig. 1 requires ≈ 23 ms for the preprocessing stage,
improvement of tracking robustness by ICP + IMU. Consid- 20 − 25 ms and ≈ 11 ms for the pose estimation stage in the
ering fast motion scenarios, ICP delivers reliable results only l = 3 and the l = 1 case, respectively, 41 − 51 ms for the depth
for one configuration, i.e. the Hen_Duplo scenario (Tab. III) map fusion, and 35 − 57 ms for the surface reconstruction.
with all pyramid levels and all frames (which is de-facto
not achievable in online mode), whereas ICP + IMU yields
valid reconstruction in almost all scenarios and configurations. C. Limitations
Fig. 6(b) shows visual artifacts due to a large drift error 1) Gap in Sensor Sampling Rate: A higher sampling fre-
in the Office scene reconstructed by ICP using All frames quency of IMU allows to bridge longer intervals between
and l = 3. The comparison Fig. 6(a) shows the more precise depth images. However, the integration of raw IMU data
reconstruction results achieved with ICP + IMU using fewer leads to a considerable sensor drift due to error accumulation
frames (1/2) and pyramid level (l = 1). Note that the artifacts (see discussion in Sec. IV). Although our method partially
on the computer screen are attributable to the general weakness corrects the integration errors by the EKF at the end of each
of ToF cameras in capturing strongly light absorbing, i.e. dark, inter-frame interval, the error increases for longer temporal
surfaces. gaps between successive processed depth images, causing the
For the frame dropping rate 1/2 that is close to the online a-posteriori estimates of the EKF be less precise. Furthermore,
frame dropping due to the hardware limitations, ICP + IMU larger inter-frame intervals lead to a decreased reliability of
works successfully in all experiments, which demonstrate the ICP pose extrapolations, particularly in the case of abrupt
real-time capability of our method. Furthermore, ICP + IMU changes of motion direction. Thus, the difference between
preserves tracking stability with fewer pyramid levels in most initial guess and final pose in our method increases by higher
cases. In particular, in all tests with the “online-like” frame frame dropping rates albeit, as a general tendency, remaining
2914 IEEE SENSORS JOURNAL, VOL. 18, NO. 7, APRIL 1, 2018
TABLE VI
OVERVIEW OF THE F UNDAMENTAL C HARACTERISTICS OF M OBILE A PPROACHES FOR S CENE R ECONSTRUCTION F ROM R ANGE D ATA :
R EQUIRED S ENSORS , A PPLICATION OF S ENSOR F USION , I NTERNAL M ODEL R EPRESENTATION AND
S YSTEM P ERFORMANCE (VALUE I S N / A IF N O D ATA A RE AVAILABLE )
below this parameter in the standard approach. For example, the common initialization approach. On the one hand, our
only one “fast” data set (Hen)) can be processed with the method shows a higher stability in the case of fast camera
dropping rate 1/4 (see Tab. II). motion. On the other hand, our approach reliably deals with
The only way to counteract this problem is to reduce the low temporal resolution of highly integrated depth cameras
the inter-frame gap by faster depth frame processing, which, such as the picoflexx. In addition, the robust ICP initialization
in turn, requires novel reconstruction approaches. allows to reduce the number of ICP image pyramid levels and,
2) Noise Estimation: A further aspect that can limit the consequently, to achieve a higher depth frame throughput. All
prediction precision is low accuracy of the noise estimation. in all, our approach minimizes the negative impact of hardware
As described in Sec. IV, we tuned the process noise covari- limitations existing on a mobile system and allows 3D point-
ance matrix experimentally, however, a measurement-based based scene reconstruction for fast camera motion. Tab. VI
noise estimation, for instance using Allan variance [32] might summarizes some distinctive characteristics of the approaches
provide more accurate results. Concerning the noise of ICP for scene reconstruction from range data on mobile devices,
extrapolations, an online calculation depending on the length illustrating the most important conceptual differences between
of the inter-frame interval may lead to a more appropriate our method and existing solutions.1
noise estimation model. In future work we will improve the robustness of our
3) ICP Failure Handling: A good initial guess of the current tracking method in scenarios with sparse geometric features by
camera pose can prevent ICP errors, such as convergence integrating a corresponding ICP failure detection logic. Thus,
in local minima or false correspondence association, related also ICP failures due to scenes with low geometric feature can
to large frame-to-frame transformations. However, a lack of be handled more robustly. Furthermore, we aim to enhance the
geometric features, e.g. when sliding along planar geometries, accuracy of EKF by means of an adaptive noise estimation.
also results in ICP stability problems. In these cases the ICP
and EKF estimations diverge. If this constellation appears in
ACKNOWLEDGMENTS
several successive frames, tracking failure may occur. The
experiment Hen, All, 1/2, l = 1 (Tab. II) shows that in this We would like to thank Dr. Holger Nies for his time in
particular case even discarding frame prevents tracking failure. discussing details of the Kalman-filtering.
A possible solution for the above problem is to discard
ICP results for such frames and continuing inertial-based
estimation until new reliable ICP estimates are available. This R EFERENCES
requires sophisticated ICP error detection [15], [21]. This [1] S. Rusinkiewicz, O. Hall-Holt, and M. Levoy, “Real-time 3D model
improvement is orthogonal to our proposed method and will acquisition,” ACM Trans. Graph., vol. 21, no. 3, pp. 438–446, 2002.
be integrate in our system as part of the future work. [2] R. A. Newcombe et al., “KinectFusion: Real-time dense surface mapping
and tracking,” in Proc. IEEE Int. Symp. Mixed Augmented Reality,
Oct. 2011, pp. 127–136.
VI. C ONCLUSIONS [3] M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich, and A. Kolb,
“Real-time 3D reconstruction in dynamic scenes using point-based
In this paper we presented a novel, lightweight solution for fusion,” in Proc. Int. Conf. 3D Vis. (3DV), 2013, pp. 1–8.
real-time 3D reconstruction on mobile devices that uses IMU [4] P. Tanskanen, K. Kolev, L. Meier, F. Camposeco, O. Saurer, and
data as the only additional sensory input. Our pose estimation M. Pollefeys, “Live metric 3D reconstruction on mobile phones,” in
Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2013, pp. 65–72.
incorporates a novel EKF-based fusion of inertial tracking
data with extrapolated ICP camera poses in order to initial- 1 The parameter “Model extent” in this table refers to the extent limitations
ize the ICP pose estimation. We demonstrate considerable due to the system design. Note that the real model size is also limited by
enhancement of the tracking robustness in comparison with available memory.
PRESNOV et al.: ROBUST RANGE CAMERA POSE ESTIMATION FOR MOBILE ONLINE SCENE RECONSTRUCTION 2915
[5] P. Ondrúška, P. Kohli, and S. Izadi, “MobileFusion: Real-time volu- [23] D. Lefloch, M. Kluge, H. Sarbolandi, T. Weyrich, and A. Kolb,
metric surface reconstruction and dense tracking on mobile phones,” “Comprehensive use of curvature for robust and accurate online surface
IEEE Trans. Vis. Comput. Graphics, vol. 21, no. 11, pp. 1251–1258, reconstruction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12,
Nov. 2015. pp. 2349–2365, Dec. 2017.
[6] T. Schöps, T. Sattler, C. Häne, and M. Pollefeys, “3D modeling on [24] A. Kolb, J. Zhu, and R. Yang, “Sensor fusion,” in Digital Representations
the go: Interactive 3D reconstruction of large-scale scenes on mobile of the Real World. Boca Raton, FL, USA: CRC Press, 2015, ch. 9,
devices,” in Proc. Int. Conf. 3D Vis., Oct. 2015, pp. 291–299. pp. 133–150.
[7] O. Muratov, Y. Slynko, V. Chernov, M. Lyubimtseva, A. Shamsuarov, [25] P. J. Besl and D. N. McKay, “A method for registration of 3-D shapes,”
and V. Bucha, “3DCapture: 3D reconstruction for a smartphone,” in IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 2, pp. 239–256,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPR), Feb. 1992.
Jun./Jul. 2016, pp. 291–299. [26] D. Titterton and J. Weston, Strapdown Inertial Navigation Technology,
[8] Infineon. (2015). Real3 (TM) Image Sensor Family—3D Depth 2nd ed. Stevenage, U.K.: IET, 2004.
Sensing Based on Time-of-Flight. [Online]. Available: http://www. [27] A. Cavallo et al., “Experimental comparison of sensor fusion algorithms
infineon.com/dgdl/Infineon-REAL3%20Image%20Sensor%20Family- for attitude estimation,” IFAC Proc. Vol., vol. 47, no. 3, pp. 7585–7591,
PB-v01_00-EN.PDF?fileId=5546d462518ffd850151a0afc2302a58 2014.
[9] PMD Technologies GmbH. (2015). Reference Design Brief CamBoard [28] R. K. Mehra, “On the identification of variances and adaptive Kalman
Pico Flexx. [Online]. Available: http://pmdtec.com/picoflexx/downloads/ filtering,” IEEE Trans. Autom. Control, vol. 15, no. 2, pp. 175–184,
PMD_RD_Brief_CB_pico_flexx_V0201.pdf Apr. 1970.
[10] S. May, D. Droeschel, D. Holz, C. Wiesen, and S. Fuchs, “3D pose [29] J. Wang, “Stochastic modeling for real-time kinematic GPS/GLONASS
estimation and mapping with time-of-flight cameras,” in Proc. IROS positioning,” Navigation, vol. 46, no. 4, pp. 297–305, 1999.
Workshop 3D Mapping, 2008, pp. 1–6. [30] F. Aghili and C.-Y. Su, “Robust relative navigation by integration of ICP
[11] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendón-Mancha, and adaptive Kalman filter using laser scanner and IMU,” IEEE/ASME
“Visual simultaneous localization and mapping: A survey,” Artif. Intell. Trans. Mechatronics, vol. 21, no. 4, pp. 2015–2026, Aug. 2016.
Rev., vol. 43, no. 1, pp. 55–81, 2015. [31] Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans.
[12] G. Burel and H. Henoco, “Determination of the orientation of 3D objects Pattern Anal. Mach. Intell., vol. 22, no. 11, pp. 1330–1334, Nov. 2000.
using spherical harmonics,” Graph. Models Image Process., vol. 57, [32] N. El-Sheimy, H. Hou, and X. Niu, “Analysis and modeling of inertial
no. 5, pp. 400–408, 1995. sensors using Allan variance,” IEEE Trans. Instrum. Meas., vol. 57,
[13] R. L. Larkins, M. J. Cree, and A. A. Dorrington, “Analysis of binning of no. 1, pp. 140–149, Jan. 2008.
normals for spherical harmonic cross-correlation,” Proc. SPIE, vol. 8290,
p. 82900L, Jan. 2012.
[14] J. Salvi, C. Matabosch, D. Fofi, and J. Forest, “A review of recent
range image registration methods with accuracy evaluation,” Image Vis.
Comput., vol. 25, no. 5, pp. 578–596, 2007. Dmitri Presnov received the M.Sc. degree in computer science from the
[15] M. Nießner, A. Dai, and M. Fisher, “Combining inertial navigation and University of Siegen, Germany, in 2017. He is currently a Researcher with
ICP for real-time 3D surface reconstruction,” in Proc. EUROGRAPHICS, the Computer Graphics and Multimedia Systems Group, University of Siegen.
2014, pp. 13–16.
His research interests are 3-D scene reconstruction on mobile environments
[16] I. Dryanovski, M. Klingensmith, S. S. Srinivasa, and J. Xiao, “Large- and medical visualization.
scale, real-time 3D scene reconstruction on a mobile device,” Auto.
Robots, vol. 41, no. 6, pp. 1423–1445, 2017.
[17] O. Kähler, V. A. Prisacariu, C. Y. Ren, X. Sun, P. Torr, and D. Murray,
“Very high frame rate volumetric integration of depth images on
mobile devices,” IEEE Trans. Vis. Comput. Graphics, vol. 21, no. 11,
pp. 1241–1250, Nov. 2015. Martin Lambers received the Diploma degree in mathematics from the
[18] M. Klingensmith, I. Dryanovski, S. S. Srinivasa, and J. Xiao, “CHISEL: University of Münster, Germany, in 2006, and the Dr.Ing. degree in computer
Real time large scale 3D reconstruction onboard a mobile device science from the University of Siegen, Germany, in 2011. He is currently
using spatially hashed signed distance fields,” Robot., Sci. Syst., vol. 4, a Senior Researcher with the Computer Graphics and Multimedia Systems
Jul. 2015. Group, University of Siegen. His research interests include simulation,
[19] J. Huai, Y. Zhang, and A. Yilmaz, “Real-time large scale 3D reconstruc- processing, analysis, and visualization of multimodal sensor data.
tion by fusing Kinect and IMU data,” ISPRS Ann. Photogramm., Remote
Sens. Spatial Inf. Sci., vol. II-3/W5, pp. 491–496, Aug. 2015.
[20] J. C. K. Chow, D. D. Lichti, J. D. Hol, G. Bellusci, and H. Luinge, “IMU
and multiple RGB-D camera fusion for assisting indoor stop-and-go 3D
terrestrial laser scanning,” Robotics, vol. 3, no. 3, pp. 247–280, 2014. Andreas Kolb received the Ph.D. degree from the University of Erlangen,
[21] T. Hervier, S. Bonnabel, and F. Goulette, “Accurate 3D maps from depth Germany, in 1995. He is currently the Head of the Computer Graphics and
images and motion sensors via nonlinear Kalman filtering,” in Proc. Multimedia Systems Group, University of Siegen, Germany. He is also the
IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 5291–5297. Spokesman of the Research Training Group Imaging New Modalities, funded
[22] M. Camurri, S. Bazeille, D. G. Caldwell, and C. Semini, “Real-time by the German Research Foundation. His research interests include computer
depth and inertial fusion for local SLAM on dynamic legged robots,” graphics and computer vision, including particle-based simulation and visu-
in Proc. IEEE Int. Conf. Multisensor Fusion Integr. Intell. Syst. (MFI), alization, light-fields, real-time simulation, processing, and visualization of
Sep. 2015, pp. 259–264. sensor data.