Smart Camera Network Localization Using A 3D Target

Smart Camera Network Localization Using a 3D Target
John Kassebaum, Nirupama Bulusu, Wu-Chi Feng

Portland State University
{kassebaj, nbulusu, wuchi}@cs.pdx.edu
ABSTRACT geometry between pairs of cameras with overlapping views;
We propose a new method to localize in three dimensions but, while accurate, the difficulty inherent to computer
the camera-equipped nodes in a smart camera network. Our vision-based localization techniques is the requirement of
method has both lower costs and fewer deployment detecting and correlating a large number of world feature
constraints than a commonly used computer vision-based points common in the views of multiple cameras.
approach, which is to opportunistically determine feature Determining these point correspondences opportunistically
points in the overlapping view of pairs of cameras, compute requires extensive data categorization [12] and message
the essential matrix for all such pairs, then perform a bundle passing, beyond the capabilities of resource constrained
adjustment to both refine all camera positions and smart camera sensor platforms [14]. Our solution directly
orientations a determine a common scale. Our method addresses the point correspondence problem with a feature
utilizes a feature point filled 3D localization target with point filled and efficiently detectable 3D localization target.
efficient detection algorithm to determine the projection Another key advantage to using a 3D localization target is
matrix for a camera viewing the target. Because the that using its detected feature points to determine a camera’s
projection matrix gives the position and orientation of the projection matrix gives the camera’s position and orientation
camera in the external coordinate frame of the localization in the 3D coordinate frame defined by the target’s geometry.
target, two or more nodes simultaneously localizing Not only does this allow for a meaningful and common
themselves to the target are automatically localized in the metric to be applied while localizing the network, but it also
same coordinate frame. This technique can be used to simplifies alignment of all cameras to the same coordinate
localize a smart camera network with connected views frame. This is because any two cameras that localize to the
because as the target moves through the network each node same target position are automatically localized to the
will localize itself to at least two target positions that are target‘s geometry.
related by an easily determined rotation and translation and
which can be used to globally align all node positions and Localizing an entire view-connected smart camera network
orientations to any single network-viewable target position. requires moving the target through the overlapping views of
We present results from a real indoor network and suitably all pairs of cameras. Because accuracy is maintained when
designed localization target, and show that our method can the target appears small in frame, the necessary degree of
accurately localize the network when the target’s feature overlap is small. We evaluate our solution in a real network
points fill less 5% of the frame. Because the target can be using the Panoptes embedded video sensors [14], consisting
relatively small in frame, pairwise camera overlap can also of low cost webcams and the PDA-class Stargate processing
be small. platform. Our results show that our solution has the same
level of accuracy as epipolar geometry-based methods, but
Categories and Subject Descriptors requires both less computation and message passing.
C.2.3 [Network Operations]: Computer-Communication
Networks – Network management, network monitoring, 2. RELATED WORK
public networks. Automated methods to localize sensor networks typically
require a source of range information in order to triangulate
General Terms node positions and orientations. Non-camera-equipped
Algorithms, Measurement, Design. networks, often consisting of resource constrained scalar
sensors, can measure ranges from ultrasound, radio, or
Keywords: Localization, smart camera networks. acoustic signals [10]. Camera equipped networks, to which
our localization solution applies, can infer ranges from
1. INTRODUCTION visually gathered information. Visual information can be of
Distributed camera sensor networks can be used for many two types: 1) motion which can be tracked to infer an
applications such as unobtrusive monitoring and tracking of object’s trajectory and thereby probabilistically identify and
wildlife and eco-habitats, 3D surveillance of people and correlate targets in different camera views, or 2) detectable
vehicles in urban spaces, next generation network games and static world feature points observed in overlapping fields of
virtual reality. To establish spatial context in network views of cameras and gathered either opportunistically or
deployments for such applications, one could manually from deliberately placed identifiable markers. Two solutions
measure camera positions and orientations, yet this is neither that utilize motion tracking are [4,10]. Both maintain a joint
efficient nor scalable and is subject to errors. Automatic, distribution over trajectories and camera positions but only
computer vision-based localization approaches exist, in 2D. While the results presented in [4] are restricted to 2D,
including [5,6,7,8,9] which rely on determining the epipolar [10] produces a 3D result by pre-computing camera
translations and orientations to a common ground plane in essential matrix estimation, each must be realigned to a
which all motion is identified and tracked. Solutions common, but still unknown, scale using a centrally
utilizing static feature point detection can localize in 3D with processed bundle adjustment over all camera parameters and
no prior knowledge of camera deployment. Table 1 provides triangulated world feature points.
a comparitive overview of 5 previously proposed solutions
Our solution, utilizing a 3D localization target, seeks to
[5,6,7,8,9] to our own solution.
minimize the cost of feature point detection and inherently
Localization methods that use static feature point detection provide scale. The image points of the target’s feature points
and correlation require a minimum of pairwise overlapping are easily and robustly determined by a simple and efficient
views and a fully view-connected network. [5,7] require a detection algorithm. Because the geometry of the target is
minimum of triples of cameras with shared views. Also, the known and used for projection matrix estimation, camera
previous proposed solutions in Table 1 rely on essential position and orientation is always given in the target’s
matrix estimation to determine the epipolar geometry coordinate frame.
between pairs of cameras, and thereby deduce their relative
poses ([5] estimates projection matrices from determining 3. ALGORITHM
epipolar geometry). But because essential matrix estimation
requires correlated sets of image points of commonly 3.1 The Projection Matrix
detected features whose 3D world coordinates are unknown, The essential matrix expresses epipolar geometry between
the scale of the geometric relationship between the cameras two cameras and can be estimated from correlated sets of
cannot be determined. [7,8] each propose the use of a image points of world features common in both camera’s
calibration target consisting of a bar of known length with views. The projection matrix expresses how one camera
LEDs at each end; the known 3D length between the projects world coordinates in an independent 3D coordinate
detected lights serves as a constraint in determining scale. frame to 2D pixel points. This projection transforms
Lacking a target, opportunistic feature point detection using between four different coordinate frames, shown in Figure 1.
SIFT [12] must be employed which requires extensive image 3D world coordinates in an arbitrary world coordinate frame
processing and messaging to correlate image points across (WCF) are translated and rotated to a camera-centric
views. Also, because scale is unknown in each pairwise
Table 1. Comparison of feature point correlation-based smart camera network localization methods
Devarajan et al. Lymberopoulos et al. Mantzel et al. Kurillo et al. Medeiros et al. Our solution
structure from motion epipolar estimation; iterative epipolar and epipolar with iterative epipolar pairwise projection
complexity estimates projection matrices refinement via pre- projection matrix centralized sparse estimation, distributed matrix estimation
for triples of cameras known constraints (re)estimation bundle adjustment refinement
nodes correlate common pairwise cameras pairwise localized nodes target provides image target provides image point easily detectable 3D
inlier feature points to form compute epipoles triangulate 3D world point correspondences for target of known
a vision graph; graph guides and orientation points to provide to correspondences for pairwise essential matrix geometry provides
clustering (min=3) where from direct epipole unlocalized neighbors pairwise essential estimation; scale rectified by
image to world point
each node in cluster observation or who then localize from matrix estimation; target known dimension; correspondences for
estimates all nodes’ fundamental matrix projection matrix scale rectified by reference index algorithm pairwise projection
projection matrices using estimation; known estimation; more target known guides realignment of matrix estimation
SFM and refines with lengths between localized cameras dimension; shortest pairwise localizations to and common metric
algorithm bundle adjustment. A some nodes provide causes more pairwise paths on vision graph global coordinate frame; and scale for all
second BA refines camera scale and constrain localizations causes re- guides realignment distributed weighted pairwise
parameters and feature refinement of triangulated 3D world of pairwise recursive least squares localizations; target
points including unknown lengths points -- which causes localizations to technique updates path through network
reconstructed feature points re-localization of global coordinate localizations when either guides realignment
not in the original set of individual cameras; frame; centralized new target point acquisitions
of pairwise
inliers. iterates until refined with sparse allow re-pairwise calibration
localizations to
convergence of bundle adjustment or when neighbor provides global coordinate
localizations updated localization estimate
frame
nothing known lengths some cam positions and precalibrated precalibrated cameras precalibrated
assumes between nodes orientations known cameras cameras
deployment multiple triples (or more) of views overlap, multiple triples of pairwise overlapping pairwise overlapping views, pairwise overlapping
constraints cameras with shared fields some cameras see cameras with shared views, time time synchronized nodes views, time
of view each other fields of view synchronized nodes synchronized nodes
message SIFT categorized feature estimated distances initial and revised initial and refined detected 3D points; transformation to
passing points to neighbors between nodes to localization estimates coordinates of neighbors’ relative position global frame to
neigh. observed 3D points when updated pairwise neighbor
opportunistic using SIFT and LEDs on camera unspecified, suggests opportunistic opportunistic detection of opportunistic
feature RANSAC to detect highly and non-camera opportunistic via detection of known known length bar with LED detection of 3D
point reliable inliers nodes opportunistic motion length bar with LED on each end target
detection tracking on each end
simulated 45cm error per actual 60cm error at no actual deployment simulated 0.2% simulated positional error ≈ actual pos. and orient.
camera at 220m scene width 297cm avg node-to- evaluated; simulated position error (% of 1” with no noise; very error < .01% for
at 1 pixel noise, orientation node length; 20cm error of .25% of RMSE between est. accurate orientations; no pairwise localizations;
accuracy error not clear; accuracy of if all epipoles deployment area and actual) at noise actual deployment max global error < .
actual shown only as observed; no diameter; not stated how < .6 pixel; small evaluated 01n% n hops from
accurate wireframe orientation error error applies to reprojection error first camera localized
reconstruction given orientation stated for actual test
requires multiple camera best when camera particular method suited accuracy greatly benefits from multiple propagates single
overlaps; actual test contains nodes see and are to 3D target, but none improved if target target passes as more 3D camera localization
issues set of 12 overlapping able to detect each tested spans 1/3 of frame points triggers update to errors to later pair-
cameras other width localizations which triggers wise localizations
more refinement
coordinate frame (CCF) whose origin is the camera’s point detected feature point and its projection by the estimate of P.
of perspectivity. Using homogenous coordinates for 3D We both reduce the cost of minimization and achieve more
points and where R is a 3D rotation matrix and C a 3D accurate results by using pre-computed intrinsic and lens
translation vector: distortion parameters [3] in the evaluation of the error
!
XCCF
" !
R −RC
"!
XW CF
" function, rather than the parameters obtained from the
1
=
OT3 1 1 decomposition of P.
Next, 3D CCF points are projected to 2D points in the 3.3 Network Localization
image coordinate frame (ICF). The projection is a scaling
The rotation matrix and translation vector decomposed from
of 3D CCF points by f/Z along their ray through the origin,
an estimate of P give the camera’s position and orientation in
where f is the camera’s focal length; the image plane is at a
the 3D world coordinate frame defined by the geometry of
distance f from the CCF’s origin.
the localization target. Thus, any 2 or more cameras that
localize to the same target position are automatically
  
f 0 0 X % &T
f XCCF f YCCF
 0 f 0  Y  = (f XCCF , f YCCF , ZCCF )T ≡ ,
0 0 1 Z
ZCCF ZCCF localized in the same coordinate frame. The 2 (or more)
CCF
cameras’ relative positions and poses are determined without
The 3D scaled point is considered a 2D homogenous point
having computed an essential matrix.
and converted to inhomogeneous coordinates.
Globally localizing an entire view-connected network in the
Finally, 2D ICF points are translated to 2D pixel coordinate
same 3D coordinate frame requires subsequently positioning
frame (PCF) points. This is a transformation from a right
the localization target in the view of all pairs of cameras.
handed coordinate system to the traditional left handed
This movement can be automated. When the target appears
PCF which has its origin in the top left corner of a frame.
simultaneously to two unlocalized cameras, each localizes to
If (x0,y0) are the coordinates of the ICF origin in the PCF:
the target’s current coordinate frame by estimating P after
u = x + x0 and v = y + y0 detecting the target’s feature points. When the target later
The projection matrix combines all transformations into a appears simultaneously in the view of an already localized
3x4 matrix: camera and an as-yet unlocalized camera, again each
f

0 x0

1 0 0 0

% & localizes to the target’s current position but then the camera
P = 0 f y0   0 1 0 0 
R −RC with 2 different localizations computes and passes to the
OT3 1 other a rotation and translation that realigns the current target
0 0 1 0 0 0 1
= KR(I3x3 | − C) coordinate frame to the prior. This has the effect of bringing
the newly localized camera into the same coordinate frame
K is referred to as the camera calibration matrix and as that shared by the other camera and its previous pairwise
contains the camera’s 5 intrinisic parameters. R and C are partner. The realignment of cameras to previous coordinate
the camera’s 6 extrinsic parameters, which yield the frames can occur either in a linear fashion as the target
camera’s orientation and position in the WCF. moves through the network, or it can be done in a more
3.2 Camera Position and Orientation strategic way, such as after computing all pairwise
localizations, realignment begins from the camera pair that
P can be estimated from a correlated set of known 3D
has the shortest path to the leaves in a vision graph of the
world points and their 2D pixel points. Reportedly, 28
network [5,8]. Because error from single camera
point correlations are sufficient, but due to noise in real-
localizations propagates with the realignments, which will be
world feature point detections, using more correlations
discussed in Section 4, the latter approach is highly advised.
gives better results. The point coordinate values are used in
an over-determined system of linear equations that is 3.4 The Localization Target
solved with the singular value decomposition. The left 3x3 Due to the widely varying environmental and lighting
submatrix of the estimate of P can be decomposed into KR conditions in possible smart camera network deployments,
using the RQ decomposition [3]. Because [I3x3 | -C]C = 0, C as well as variations in camera quality, subject size, and
is determined from the null space of P. baselines between cameras, it is unreasonable to expect that
We use Levenberg-Marquadt to minimize projection error. any one 3D localization target will be suitable for all
Projection error is computed as the distance between a networks. Rather, a target should be designed specific to the
y
Pixel
Coordinate (X, Y, Z) deployment environment and purpose. To demonstrate the
Frame (PCF)
OP
practicality of using a 3D localization target, we have
x u
Camera y (x,y)
designed and created a small target with 288 feature points
(u,v) v
Coordinate
Frame (CCF) x
set across 6 differently angled grids. We have also designed
OC
OI
z and implemented a simple and efficient detection algorithm.
f
y
World
Detection of our target (in a 640x480 image) begins by
stepping to find a green pixel. Then: find all contiguous
Image Coordinate
Coordinate Frame (WCF)
Frame (ICF)
x
green pixels on the row; from the line’s midpoint, find all
z contiguous green pixels on the column; consider this vertical
Figure 1. Projection of 3D points to 2D pixel points line to be the vertical diameter of the sphere atop the target;
Figure 2. Detecting the 3D localization target’s feature points
use the midpoint as a starting reference for finding all grid- Figure 6 shows the change in the estimate of the z-axis
side edges of the colored areas beside each grid. These orientation at each hop. This error is consistently higher than
edges define target-relative horizontal and vertical lines that error in other orientation estimates, and may be due to the
bound a grid and define scan line length and orientation for fact that in our setup the z-coordinate value of target
finding edge fits to all sides of squares in the grid. positions was much greater than the other two.
Intersecting edge fits gives corners of squares, which are the
feature points of the target, shown in Figure 2. Our results 4.3 Message Passing
are generated with the target upright for ground truth Message passing in our solution consists only of determining
measurement purposes, but the detection algorithm does not simultaneous target detections between pairs of cameras, and
require it to be so. We have also verified detection functions
well under various lighting conditions. Figures 3 and 4: Accuracy of single camera localizations
4. EVALUATION
4.1 Single Camera Localization
Our algorithm’s accuracy is dependent upon the accuracy of
the projection matrix estimation by individual cameras upon
observing the target. This is because each camera computes
and passes to at least one neighbor a transformation between
two target coordinate frames it has localized to. Any error in
these localizations is propagated through the network via the
passed transformations.
Figures 3 and 4 demonstrate the accuracy of single camera
localizations using our indoor localization target. Figure 3
shows the position error of a single camera’s localization as
the target is placed successively further from the camera.
Position error is defined as the percentage of the Euclidean
error of the camera position estimation to the camera-to-
target distance. Percentage of frame area occupied by point
matches means the percentage of the area, in square pixels,
of the bounding box around all target feature points to the
total image area. The graph shows that the localization error
when the target is close to the camera is the same at
subsequent target positions–thus the decreasing percentage.
This suggests that the error is likely from an inaccuracy in
manual measurement, because it is consistent. Figure 4
shows that estimated orientation angles fluctuate less than
0.3 of a degree over the same configuration.
4.2 Network Localization
Figure 5 shows the position error of cameras realigned to the
network’s global coordinate frame. Due to the propagation
of single camera errors in transformations passed to
neighbors, the error increases at each hop away from the
camera chosen as origin of the global coordinate frame.
Figures 5 and 6: Accuracy of network localization takes the next step of a full-featured 3D target that not only
resolves scale, but also reduces both message passing and
computation.
7. ACKNOWLEDGEMENTS
This material is based upon work supported by the National
Science Foundation under Grant No. CNS-0722063. Any
opinions, findings, and conclusions or recommendations
expressed in this material are those of the authors and do not
necessarily reflect the views of the National Science
Foundation.
8. REFERENCES
[1] Hartley, R. and Zisserman, A. Multiple View Geometry
in Computer Vision. Cambridge University Press, 2000.
[2] Faugeras, O. The Geometry of Multiple Images. MIT
Press, 2004.
[3] Zhang, Z. A Flexible New Technique for Camera
Calibration. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 22, 11, (Nov 2000),
1330-1334.
[4] Funiak, S., Guestrin, C., Paskin, M., and Sukthankar, R.
Distributed localization of networked cameras.
Information Processing In Sensor Networks, (Apr
2006), 34-42.
[5] Devarajan, D., Radke, R., and Chung, H. Distributed
Metric Calibration of Ad-Hoc Camera Networks. ACM
Transactions on Sensor Networks, 2, 3 (Aug. 2006),
380-403.
[6] Lymberopoulos, D., Barton-Sweeny, A., and Savvides,
A. Sensor localization and camera calibration using low
power cameras. ENALAB Technical Report, 090105,
September 2005.
[7] Mantzel, W., Choi, H., Baraniuk, R.G. Distributed
camera network localization. 38th Asilomar Conference
on Signals, Systems and Computers, (Nov 2004)
[8] Kurillo, G., Li, Z., Bajcsy, R. Wide-area external multi-
the passing of realignment transformations. Because camera calibration using vision graphs and virtual
projection matrix estimation occurs between the target and calibration target. Second ACM/IEEE International
Conference on Distributed Smart Cameras. (Sep 2008)
one camera, there is no need to pass or correlate detected
[9] Medeiros, H., Iwaki, H., Park, J. Online distributed
feature point sets between pairs of cameras. calibration of a large network of wireless cameras using
5. FUTURE WORK dynamic clustering. Second ACM/IEEE International
Conference on Distributed Smart Cameras. (Sep 2008)
Due to the uncertainty manual measurement errors cast over [10] Taylor, C., Rahimi, A., Bachrach, J., Shrobe, H., and
our real deployment results, we are implementing a Grue. A. Simultaneous localization, calibration, and
simulator. We will also implement a centralized bundle tracking in an ad hoc sensor network. Information
adjustment for comparison purposes, as well as the use of Processing In Sensor Networks, (2006), 27-33.
local pairwise bundle adjustments, although both would [11] Rahimi, A., Dunagan, B., Darrell, T. Simultaneous
increase message passing if adopted into the solution. calibration and tracking with a network of non-
overlapping sensors. Computer Vision and Pattern
6. CONCLUSION Recognition, 1, (Jul 2004), 187-194.
We have presented a new solution for smart camera network [12] Lowe, D. Distinctive Image Features from Scale-
localization in 3D that addresses both the point Invariant Keypoints. International Journal of Computer
correspondence problem and high amount of processing Vision, 60, 2 (2004), 91-110.
required in epipolar geometry-based computer-vision [13] Bindel, D., Demmel, J., and Kahan, W. On computing
givens rotations reliably and efficiently. ACM
localization algorithms. Our solution also addresses the Transactions on Mathematical Software, 28, 2, (Jun
unknown scale issue inherent in using epipolar geometry to 2002), 206-238.
determine relative pose between cameras. Recent epipolar [14] Feng, W., Code, B., Kaiser, E., Shea, M., Feng, W.,
geometry-based solutions [8,9] propose the use of a simple Panoptes: scalable low-power video sensor networking
2D calibration target to resolve the scale issue. Our solution technologies. ACM Multimedai (2003), 90-91.

Smart Camera Network Localization Using A 3D Target

Uploaded by

Copyright:

Available Formats

Smart Camera Network Localization Using A 3D Target

Uploaded by

Copyright:

Available Formats

Smart Camera Network Localization Using a 3D Target

John Kassebaum, Nirupama Bulusu, Wu-Chi Feng

You might also like