Hand and Stereo Vision
Hand and Stereo Vision
Hand and Stereo Vision
Abstract
A maximum a posterior probability zero disparity filter (MAP ZDF) ensures coordi-
nated stereo fixation upon an arbitrarily moving, rotating, re-configuring hand, per-
forming marker-less pixel-wise segmentation of the hand. Active stereo fixation per-
mits real-time foveal hand tracking and segmentation over a large visual workspace,
allowing investigation of unrestricted natural human gesturing. Hand segmentation
is shown to be robust to lighting conditions, defocus, hand colour variation, fore-
ground and background clutter including non-tracked hands, and partial or gross
occlusions including those due to non-tracked hands. The system operates at ap-
proximately 27f ps on a 3GHz single processor PC.
Key words: Active Stereo Vision, Zero Disparity Filter, Markov Random Field,
Hand Segmentation and Tracking, Human-Computer Interaction
PACS: 01.30.-y
1 Introduction
Humans interact with each other efficiently using mutually understood words,
gestures and actions. Intelligent artificial systems that gather information from
tions, Information Technology and the Arts and the Australian Research Council
through Backing Australia’s ability and the ICT Centre of Excellence Program.
When tracking objects such as hands under real-world conditions, three main
problems are encountered: ambiguity, occlusion and motion discontinuity. Am-
biguities arise due to distracting noise, mismatching of the tracked objects
and the potential for multiple hands, or hand-like distractors, to overlap the
tracked target. Occlusions are inevitable in realistic scenarios where the sub-
ject interacts with the environment. Certainly, in dynamic scenes, the line of
site between the cameras and target is not always guaranteed. At usual frame
rates ( 30f ps), the motion of dexterous subjects such as a hand can seem er-
ratic or discontinuous and motion models designed for tracking such subjects
may be inadequate.
Existing methods for marker-less visual hand tracking can be categorised ac-
cording to the measurements and models they incorporate [14]. Regardless
of the approach, hand gesture recognition usually requires a final verification
step to match a model to observations of the scene.
2
1.1.1 Cue-Based Methods
Mean Shift and Cam Shift methods are enhanced manifestations of cue mea-
surement techniques that rely on colour chrominance based tracking. For real-
time performance, a single channel (chrominance) is usually considered in the
color model. This heuristic is based on the assumption that skin has a uni-
form chrominance. Such trackers compute the probability that any given pixel
value corresponds to the target color. Difficulty arises where the assumption
of a single chrominance cannot be made. In particular, the algorithms may fail
to track multi-hued objects or objects where chrominance alone cannot allow
the object to be distinguished from the background, or other objects.
The Mean Shift algorithm is a non-parametric technique that ascends the gra-
dient of a probability distribution to find the mode of the distribution [13,10].
Particle filtering based on color distributions and Mean Shift was pioneered
by Isard and Blake [18] and extended by Nummiaro et al. [28]. Cam Shift was
initially devised to perform efficient head and face tracking [8]. It is based on
an adaptation of Mean Shift where the mode of the probability distribution is
determined by iterating in the direction of maximum increase in probability
density. The primary difference between the Cam Shift and the Mean Shift
algorithms is that Cam Shift uses continuously adaptive probability distribu-
tions (recomputed each frame) while Mean Shift is based on static distribu-
tions. More recently, Shen has developed Annealed Mean Shift to counter the
tendency for Mean Shift trackers to settle at local rather than global maxima
[37].
3
considering a target in a 3D scene, and are not inherently intended to deal with
occlusions and other ambiguous tracking cases (for example, a tracked target
passing in front of a visually similar distractor). In such circumstances, these
trackers may shift between alternate subjects, focus on the center of gravity
of the two subjects, or track the distracting object rather than the intended
target. To alleviate this, motion models and classifiers can be incorporated,
but they may rely upon weak and restrictive assumptions regarding target
motion and appearance.
Methods exist that do not require a priori models or target knowledge. In-
stead, the target is segmented using an un-calibrated semi-spatial response
by detecting regions in images or cue maps that appear at the same pixel
coordinates in the left and right stereo pairs. That is, regions that are at
4
zero disparity 2 . To overcome pixel matching errors associated with gain dif-
ferences between left and right views, these methods traditionally attempt to
align vertical edges and/or feature points.
Unfortunately, these methods do not cope well with bland subjects or back-
grounds, and perform best when matching textured sites and features on tex-
tured backgrounds. The zero disparity class of segmentation forms the base
upon which we develop our approach.
1.2 Overview
We aim to ensure coordinated active stereo fixation upon a hand target, and to
facilitate its robust pixel-wise extraction. We propose a biologically inspired,
conceptually simple method that segments and tracks the subject in paral-
lel, eliminating problems associated with the separation of segmentation and
tracking. The method inherently incorporates spatial considerations to disam-
biguate between, for example, multiple overlapping hands in the scene such
that occlusions or distractions induced by non-tracked hands do not affect
tracking of the selected hand. As we shall see, the method does not rely on
imposing motion models on the commonly erratic trajectory of a hand, and
2 A scene point is at zero disparity if it exists at the same image frame coordinates
in the left and right images
3 The horopter is the locus of zero disparity scene points that would project to
identical left and right image coordinates if that scene point was occupied by a
visual surface.
5
Fig. 1. CeDAR, active vision apparatus.
can cope with gross partial occlusions. In this regard, the three common prob-
lems of ambiguity, occlusion and motion discontinuity are addressed. Despite
using stereo vision, the approach does not require stereo camera calibrations,
intrinsic or extrinsic. The method utilises dynamic stereo foveal scene analysis,
and we choose an active implementation that has the benefit of increasing the
volume of the visual workspace.
2 Platform
CeDAR (Fig. 1), the Cable-Drive Active-Vision Robot [40], is the experimen-
tal apparatus. It incorporates a common tilt axis and two independent pan
axes separated by a baseline of 30cm. All axes exhibit a range of motion of
greater than 90o , speed of greater than 600o s−1 and angular resolution of 0.01o .
Syncronised images with a field of view of 45o are obtained from each cam-
era at 30f ps at a resolution of 640x480 pixels. Images are down-sampled to
320x240 resolution before hand tracking processing occurs.
6
Fig. 2. Scanning the horopter over the scene: the locus of zero disparity points
defines a plane known as the horopter. For a given camera geometry, searching for
pixel matches between left and right stereo images over a small disparity range
defines a volume about the horopter. By varying the geometry, this measurable
volume can be scanned over the scene. In the first frame, only the circle lies within
the searchable region. As the horopter is scanned outwards by varying the vergence
point, the triangle, then the cube become detectable, and their spatial location
becomes known.
A vision system able to adjust its visual parameters to aid task-oriented behav-
ior – an approach labeled active [2] or animate [5] vision – can be advantageous
for scene analysis in realistic environments [4]. In terms of spatial (depth) anal-
ysis, rather than obtaining a depth map over a large disparity range (as per
static depth mapping), active vision allows us to consider only points at or
near zero disparity for a given camera geometry. Then, by actively varying
the camera geometry, it is possible to place the horopter and/or fixation point
over any of the locations of interest in a scene and thereby obtain relative
local spatial information about those regions. Where a subject is moving, the
horopter can be made to follow the subject. By scanning the horopter over
the scene, we increase the volume of the scene that may be measured. Fig. 2
shows how the horopter can be scanned over the scene by varying the camera
geometry for a stereo configuration. This approach is potentially more efficient
than static spatial methods because a small (or zero) disparity search scanned
over the scene is less computationally expensive than a large and un-scanable
disparity search from a static configuration.
Foveal systems are able to align the region around the centre of the image
(where more resources are allocated for processing) with a region of interest
in the scene such that attention can be maintained upon a subject. Active sys-
tems increase the visual workspace while maintaining high subject resolution
and maintaining a consistent level of computation. Indeed, much success has
come from studying the benefits of active vision systems [35]. Alternatively,
pseudo-active configurations are possible where either fixed cameras use hori-
zontal pixel shifting of the entire images to simulate horopter reconfiguration,
or where the fovea is permitted to shift within the image. Although feasible
for the operations presented herein, relying on such virtual horopter shifting
of the entire images reduces the useful width of the images by the number
7
of pixels of shift and dramatically decreases the size of the visual workspace.
Target contextual information is also reduced where a target moves away from
the optical centers of the static cameras such that its surroundings cannot be
seen. Valuable processing time could also be compromised in conducting whole
image shifts or in re-configuring the fovea position. Both virtual horopter and
virtual fovea approaches are simply methods to simulate true active stereo
vision.
For humans, the boundaries of an object upon which we have fixated emerge
effortlessly because the object is centred and appears with similar retinal cov-
erage in our left and right eyes, whereas the rest of the scene usually does
not. For synthetic vision, the approach is the same. The object upon which
fixation has occurred will appear with identical pixel coordinates in the left
and right images, that is, it will have zero disparity. For a pair of cameras with
suitably similar intrinsic parameters, this condition does not require epipolar
or barrel distortion rectification of the images. Camera calibration, intrinsic
or extrinsic, is not required.
4 Hand Segmentation
We begin by assuming short baseline stereo fixation upon the hand. A zero
disparity filter (ZDF) is formulated to identify the projection of the hand
as it maps to identical image frame pixel coordinates in the left and right
foveas. Fig. 7 shows example ZDF output. Simply comparing the intensities
8
Fig. 3. NCC of 3x3 pixel regions at same coordinates in left and right images.
Correlation results with higher values shown more white.
of pixels in the left and right images at the same coordinates is not adequate
due to inconsistencies in (for example) saturation, contrast and intensity gains
between the two cameras, as well as focus differences and noise.
A human can easily distinguish the boundaries of the object upon which fixa-
tion has occurred even if one eye looks through a tinted lens. Accordingly, the
regime should be robust enough to cope with these types of inconsistencies.
One approach is to normalised cross-correlate (NCC) small templates in one
image with pixels in the same template locations in the other image. The NCC
function is shown in Eq.1:
P
(u,v)∈W I1 (u, v) I2 (x + u, y + v)
N CC(I1 , I2 ) = qP , (1)
2
I22 (x + u, y + v)
P
(u,v)∈W I1 (u, v) (u,v)∈W
where I1 , I2 are the compared left and right image templates of size W and u, v
are coordinates within the template. Fig. 3 shows the output of this approach.
Bland areas in the images have been suppressed (set to 0.5) using difference of
Gaussians 4 (DOG) pre-processing. The 2D DOG kernel is constructed using
symmetric separable 1D convolutions. The 1D DOG function is shown in Eq.2:
9
not perpendicular to the camera optical axes – but appear visually similar to
the dominantly zero disparity region – can be segmented as the same object.
For these reasons, we adopt a Markov Random Field [16] (MRF) approach.
The MRF formulation defines that the value of a random variable at the set of
sites (pixel locations) S depends on the random variable configuration field f
(labels at all sites) only through its neighbours N ∈ S. For a ZDF, the set of
possible labels at any pixel in the configuration field is binary, that is, sites can
take either the label zero disparity (f (S) = lz ) or non-zero disparity (f (S) =
lnz ). For an observation O (in this case an image pair), Bayes’ law states that
the a posterior probability P (f | O) of field configuration f is proportional
to the product of the likelihood P (O | f ) of that field configuration given the
observation and the prior probability P (f ) of realisation of that configuration:
P (f | O) ∝ P (O | f ) · P (f ). (4)
In accordance with [7], we assign clique potentials using the Generalised Potts
Model where clique potentials resemble a well with depth u:
10
where δ is the unit impulse function. Clique potentials are isotropic (Vp,q =
Vq,p ), so P (f ) reduces to:
P 2u ∀fp 6= fq ,
− {p,q}∈εN
0 otherwise.
P (f ) ∝ e (8)
Note that at this stage we have looked at one image independently of the
other. Stereo properties have not been considered in constructing the prior
term.
P (O | f ) = P (IA | f, IB ), (10)
where IA is the primary image and IB the secondary (chosen arbitrarily) and
f is the hypothesized configuration field. In terms of image sites S (pixels),
Eq. 10 becomes: Y
P (O | f ) ∝ g(iA , iB , lS ), (11)
S
where g() is some symmetric function [7] that describes how well label lS fits
the image evidence iA ∈ IA and iB ∈ IB corresponding to site S. It could
for instance be a Gaussian function of the difference in observed left and
right image intensities at S; we evaluate this instance – Eq. 15 – and propose
alternatives later.
To bias the likelihood term towards hand-like objects, we include a hand cue
term HS , Eq. 12. This term is not required for the system to operate, it merely
provides a greater propensity for the MAP ZDF detector to track hand-like
scene objects (rather than any arbitrary object), as required by the task. In our
11
tuned implementation, the hand cue term enumerates (assigns a probability
to site S in each image) how hand-like a pixel site is in terms of its colour and
texture. However, formulation of the hand cue term is beyond the scope of
this paper, and to show the generality of this body of work, we have set this
term to zero throughout this paper, including results section. The reader may
formulate this term to best suit their tracking application, or modulate this
term dynamically to intelligently select the tracked/attended object.
Y
P (O | f ) ∝ g(iA , iB , lS , HS ) (12)
S
We have assembled the terms in Eq. 4 necessary to define the MAP optimisa-
tion problem:
P P
− Vp,q (fp,fq ) Y
P (f | O) ∝ e p q∈Np
· g(iA , iB , lS ). (13)
S
4.1.4 Optimisation
A variety of methods can be used to optimise the above energy function in-
cluding, amongst others, simulated annealing and graph cuts. For active vision,
high-speed performance is a priority. At present, a graph cut technique is the
preferred optimisation technique, and is validated for this class of optimisa-
tion as per [23]. We adopt the method used in [22] for MAP stereo disparity
optimisation (we omit their use of α–expansion as we consider a purely binary
field). In this formulation, the problem is that of finding the minimum cut on
a weighted graph:
The goal is to find the cut with the smallest cost, or equivalently, compute the
maximum flow between terminals according to the Ford Fulkerson algorithm
12
[12]. The minimum cut yields the configuration that minimises the energy
function. Details of the method can be found in [22]. It has been shown to
perform (as worst) in low order polynomial time, but in practice performs in
near linear time for graphs with many short paths between the source and
sink, such as this [23].
4.1.5 Robustness
We now look at the situations where the MAP ZDF formulation performs
poorly, and provide methods to combat these weaknesses. Fig. 7a shows ZDF
output for typical input images where the likelihood term has been defined
using intensity comparison. Output was obtained at approximately 27f ps for
the 60x60 pixel fovea on a standard 3GHz single processor PC. For this case,
g() in Eq. 11 has been defined as:
2
e−(∆IC ) ∀f = lz
2σ 2
g(iA , iB , f ) = −(∆I )2
(15)
1 − e 2σ2C ∀f = lnz
To combat the thresholding problem with the NCC approach, the images can
be pre-processed with a DOG kernel. The output using this technique (Fig. 7c)
is good, but is much slower than all previous methods (∼ 8f ps) and requires
yet more tuning at the DOG stage. It is still susceptible to the problem of
non-symmetric output.
13
Fig. 4. NDT descriptor construction, four comparisons.
or contrast variations between image pairs. Fig. 4 depicts the definition of the
NDT transform.
In this approach, we assign a boolean descriptor string to each site and then
compare the descriptors. The descriptor is assembled by comparing pixel in-
tensity relations in the 3x3 neighbourhood around each site (Fig. 4). In its
simplest form, for example, we first compare the central pixel at a site in the
primary image to one of its four-connected neighbours, assigning a 1 to the
descriptor string if the pixel intensity at the centre is greater than that of its
northern neighbour and a 0 otherwise. This is done for its southern, eastern
and western neighbours also. This is repeated at the same pixel site in the
secondary image. The order of construction of all descriptors is necessarily
the same. A more complicated descriptor would be constructed using more
than merely four relations 5 . Comparison of the descriptors for a particular
site is trivial, the result being equal to the sum of entries in the primary image
site descriptor that match the descriptor entries at the same positions in the
string for the secondary image site descriptor, divided by the length of the
descriptor string.
Fig. 7d shows NDT output for typical images. Assignment and comparison
of descriptors is faster than NCC DOG, (∼ 27f ps) yet requires no parameter
tuning. In Fig. 7e, the left camera gain was maximised, and the right camera
contrast was maximised. In Fig. 7f, the left camera was defocussed and satu-
rated. The segmentation retained it’s good performance under these artificial
extremes.
5 Experiment has shown that a four neighbour comparator gives results that com-
pare favorably (in terms of trade-offs between performance and processing time) to
more complicated descriptors.
14
Fig. 5. Histograms of individual NCC DOG (left) and NDT (right) neighborhood
comparisons for a series of observations.
15
MAP ZDF Tracking Algorithm:
(2) Perform a virtual shift of the left fovea by d/2 and the right fovea
by −d/2 to approximately align the location of best correlation in
the virtual centre of the left and right foveas. If the NCC result
was not sufficiently high, no physical shift will be conducted and
the process returns to the first step.
(3) MAP ZDF segmentation extracts the zero disparity pixels associ-
ated with a 2D projection of the hand from the virtually aligned
foveas. If there is indeed a hand at the virtual fixation point, the
area of the segmented region will be significantly beyond zero.
(4) If the area is greater than a minimum threshold, the virtual shift
has aligned the centre of the images over the hand. In this case,
a physical movement of the cameras is executed that reduces the
virtual shift to zero pixels, and aligns the centres of the cameras
with the centre of gravity of the segmented area. If the area is below
the threshold, there is little likelihood that a hand or object is at
the virtual fixation point, and no physical shifting is justified.
Fig. 6. Hand tracking algorithm using maximum a posterior probability zero disparity
filter (MAP ZDF) segmentation.
is selected as the template, and does not depend on the centre of the hand
being aligned in the template.
6 Results
Hand tracking and segmentation for the purpose of real-time HCI gesture
recognition and classification must exhibit robustness to arbitrary lighting
variations over time and between the cameras, poorly focussed cameras, hand
orientation, hand velocity, varying backgrounds, foreground and background
16
Fig. 7. MAP ZDF hand segmentation. The left and right images and their respective
foveas are shown with ZDF output (bottom right) for each case a-f. Result a involves
intensity comparison, b involves NCC, and c DOG NCC for typical image pairs.
Result d-f show superior NDT output for typical images d, and extreme adverse
conditions e,f.
distractors including non-tracked hands and skin regions, and hand appear-
ance such as skin or hand covering colour. System performance must also be
adequate to allow natural hand motion in HCI observations. The quality of
the segmentation must be sufficient that it does not depart from the hand over
time. Ideally the method should find the hand in its entirety in every frame,
and segment adequately for gesture recognition. For recognition, segmentation
need not necessarily be perfect for every frame because if track is maintained,
real-time classification is still possible based on classification results that are
validated over several frames. Frames that are segmented with some error still
usually provide useful segmentation information to the classifier.
Fig. 7 shows snapshots from online MAP ZDF hand segmentation sequences.
Segmentations on the right (d-f) show robust performance of the NDT com-
parator under extreme lighting, contrast and focus conditions. Fig. 8 shows the
robust performance of the system in difficult situations including foreground
and background distractors. As desired, segmentation of the tracked hand
continues. Fig. 9 shows a variety of hand segmentations under typical circum-
stances including reconfiguring, rotating and moving hands as they perform a
sequence of conceptually symbolic gestures in real time.
17
Fig. 8. Robust performance in difficult situations: Segmentation of the tracked hand
from a face in the near background (top left); from a second distracting hand in the
background (bottom left); and from a distracting occluding hand in the immediate
foreground, a distance of 3cm from the tracked hand at a distance of 2m from
the cameras (top right). Once the hands are closer together than 3cm, they are
segmented as the same object (bottom right).
7 Performance
7.1 Speed
On average, the system is able to fixate and track hands at 27f ps, including
display. Acquiring the initial segmentation takes a little longer ( 23 − 25f ps
for the first few frames) after which successive MAP ZDF optimisation results
do not vary significantly so using the previous segmentation as an initiali-
sation for the current frame accelerates MRF labeling. Similarly, the change
in segmentation area between consecutive frames at 30f ps is typically small,
allowing sustained high frame rates after initial segmentation. The frame rate
18
remains above 20f ps and is normally up to the full 30f ps camera frame rate.
7.2 Quality
The approach compares favorably to other ZDF approaches that have not
incorporated MRF contextual refinement, allowing relaxation and refinement
of the zero disparity assumption such that surfaces that are not perpendicular
to the camera axis can be segmented.
The right side images in Fig. 8 show the case where a tracked hand is oc-
19
cluded by an incoming distractor hand. The hands are located approximately
2m from the cameras in this example. Reliable segmentation of the tracked
hand (behind) from the occluding distractor hand (in front) remains until the
distractor hand is a distance of approximately 3cm from the tracked hand.
Closer than this the hands are segmented as a connected object, which is
conceptually valid.
The hand can be tracked as long as it does not move entirely out of the
fovea between consecutive frames. This is because no predictive tracking is
incorporated (such as a Kalman filter). In practice, we find that the hand
must be moved unnaturally quickly to escape track, such that it leaves the
fovea completely between consecutive frames. Tracking a target as it moves in
the depth direction (towards or away from the cameras) is sufficiently rapid
that loss of track does not occur. In interacting with the system, we find
that track was not lost for natural hand motions (see demonstration footage,
Section 9).
The visual workspace for the system remains within a conic whose arc angle
is around 100o . Performance remains good to a workspace depth (along the
camera axis) of 5m, for the resolution, baseline, and zoom settings of our stereo
apparatus. Higher resolution or more camera zoom would increase disparity
sensitivity, permitting zero disparity filtering at larger scene depths.
This work can give a basis segmentation to facilitate gesture validation. Fig. 9
shows various segmentations for conceivable symbolic gestures. Segmentation
quality is such that the hand is extracted from its surroundings which has sig-
nificant benefits in classification processes because the operation is not tainted
by background features. In order that greater restriction on the segmentation
of hand-like regions be ensured, an intuitive step would be to combine MAP
ZDF segmentation with other cues to ensure “hand-ness” of the subject. Ap-
pearance classification or model verification could also be used. The frame-
work, however, provides the means to incorporate probabilistic hand-ness of
the segmentation. By inserting knowledge of the hand into the prior term in
the ZDF formulation (for example a skin colour cue or shape/size cue), a mea-
sure of hand-ness could be incorporated into the segmentation process itself.
In this instance, reliance on a final verification step is reduced or eliminated.
The likelihood term described in the MAP ZDF formulation does not incor-
porate the hand-ness term so that we are able to accurately segment the hand
20
Fig. 10. Comparision to other methods: example output. Images reproduced from:
a) Shen (Mean Shift) [37], b) Shen (Annealed Mean Shift)[37], c) Comaniciu (Cam
Shift) [11], d) Allen (Cam Shift) [1].
and any hand-held object. The last two examples in Fig. 9 show the segmen-
tation of a hand holding a set of keys, and a hand holding a stapler. The term
has also been excluded for performance comparision with other ZDF tracking
filters that do not incorporate biasing for task-dependent tracking of specific
features such as hands (Section 7.4.2).
Fig.11 shows sample ZDF output from existing methods for comparison. These
methods provide probability distribution and bounding box outputs. The un-
derlying probability maps may be suitable for MRF refinement such as ours,
but they do not inherently provide segmentation.
21
Fig. 11. ZDF performance comparision. Images reproduced from: a) Oshiro [31], b)
Rae [32], c) Rougeaux [34], d) Yu [43], e) Rougeaux [36], f) This paper.
Fig. 3 shows sample ZDF output from our system without the incorporation
of MRF contextual refinement. Fig. 7c shows output using the same algorithm
as in Fig. 3, but incorporates MAP MRF contextual refinement from the orig-
inal images. Any attempt to use the output in Fig. 3 alone for segmentation
(via any, perhaps complex, method of thresholding), or for tracking, would not
yield results comparable to those achievable by using the output in Fig. 7c.
The underlying non-MRF processes may or may not produce ZDF probability
maps comparable to those produced by others (Section 7.4.2). However, the
tracking quality achievable by incorporating MRF contextual image informa-
tion refinement is better than is possible by the underlying ZDF process.
8 Discussion
It is critical that the MAP ZDF refinement operates at or near frame rate.
This is because we consider only the 60x60 pixel fovea when extracting the
zero disparity region. At slower frame rates, a subject could more easily escape
the fovea, resulting in loss of track. Increasing the fovea size could help prevent
this occurring, but would have the consequence of increasing processing time
per frame.
Our method uses all image information, it does not match only edges, fea-
22
tures or blobs extracted from single or multiple cues. The strongest labeling
evidence does indeed come from textured and feature rich regions of the im-
age, but the Markov assumption propagates strongly labeled pixels through
pixel neighbourhoods that are visually similar until edges or transitions in
the images are reached. The framework deals with the trade-off between edge
strengths and neighbourhood similarity in the MAP formulation.
In contrast to many motion based methods, where motion models are used
to estimate target location based on previous trajectories and motion models
(eg, Kalman filtering), the implementation does not rely upon complex spa-
tiotemporal models to track objects. It merely conducts a continual search for
the maximal area of ZDF output, in the vicinity of the previous successful
segmentation. The segmentations can subsequently be used for spatial local-
isation of the tracked object, but spatiotemporal dynamics do not form part
of the tracking mechanism.
The focus of this paper has been on hand tracking, but this is just one example
of the general usefulness of robust zero disparity filtering.
9 Conclusion
A MAP ZDF has been formulated and used to segment and track an arbitrarily
moving, rotating and re-configuring hand, performing accurate marker-less
pixel-wise segmentation of the hand. A large visual workspace is achieved by
the use of active vision. Hand extraction is robust to lighting changes, defocus,
hand colour, foreground and background clutter including non-tracked hands,
and partial or gross occlusions including those by non-tracked hands. Good
system performance is achieved in the context of HCI systems. It operates at
approximately 27f ps on a 3GHz single processor PC.
Demonstration Footage
http://rsise.anu.edu.au/~andrew/cviu05
23
References
[1] J G Allen, R Y D Xu, and J S Jin. Object tracking using camshift algorithm
and multiple quantized feature spaces. In Conf. in Research and Practice in
Inf. Tech., 2003.
[7] Y Boykov, O Veksler, and R Zabih. Markov random fields with efficient
approximations. Technical Report TR97-1658, Computer Science Department,
Cornell University Ithaca, NY 14853, 3 1997.
[8] G R Bradski. Computer vision face tracking for use in a perceptual user
interface. In Intel. Tech Journ., 1998.
[9] T. Cham and J. Rehg. Dynamic feature ordering for efficient registration. In
IEEE International Conference on Computer Vision, volume 2, Corfu, Greece,
1999.
[10] Y Cheng. Mean shift, mode seeking, and clustering. In IEEE Trans. Pattern
and Machine Intelligence, pages 17:790–799, 1995.
[12] L Ford and D Fulkerson. Flows in Networks. Princeton University Press, 1962.
[15] D.M. Gavrila and L.S. Davis. 3-d model-based tracking of human motion in
action. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 1996.
[16] S Geman and D Geman. Stochastic relaxation, gibbs distributions, and the
bayesian restoration of images. In IEEE Transactions on Pattern Analysis and
Machine Intelligence, pages 721–741, 1984.
24
[17] K. Imagawa, S. Lu, , and S. Igi. Color-based hands tracking system for sign
language recognition. In IEEE International Conference on Face and Gesture
Recognition, Nara, Japan, 1998.
[18] M Isard and A Blake. Condensation: conditional density propatation for visual
tracking. In Int. Journal of Comp. Vis., pages 29:1:5–28, 1998.
[22] V Kolmogorov and R Zabih. Multi-camera scene reconstruction via graph cuts.
In Europuan Conference on Comupter Vision, pages 82–96, 2002.
[23] V Kolmogorov and R Zabih. What energy functions can be minimized via graph
cuts? In Europuan Conf. on Comupter Vision, pages 65–81, 2002.
[24] Gareth Loy, Luke Fletcher, Nicholas Apostoloff, and Alexander Zelinsky. An
adaptive fusion architecture for target tracking. In Fifth IEEE International
Conference on Automatic Face and Gesture Recognition, page 261, 2002.
[29] R O’Hagan, S. Rougeaux, and A. Zelinsky. Visual gesture interfaces for virtual
environments. In Interacting with Computers 14, (2002) 231-250, 2002.
[30] E. Ong and S. Gong. A dynamic human model using hybrid 2d-3d
representations in hierarchical pca space. In British Machine Vision Conference,
volume 1, pages 33-42, Nottingham, UK, BMVA, 1999.
25
[31] N Oshiro, N Maru, A Nishikawa, and F Miyazaki. Binocular tracking using log
polar mapping. In IROS, pages 791–798, 1996.
[33] C. Rasmussen and G. Hager. Joint probabilistic techniques for tracking multi-
part objects. In IEEE Conference on Computer Vision and Pattern Recognition,
pages 16-21, Santa Barbara, CA, 1998.
[36] S Rougeaux and Y Kuniyoshi. Velocity and disparity cues for robust real-time
binocular tracking. In IEEE International CVPR, 1997.
[37] C Shen, M J Brooks, and A Hengel. Fast global kernel density mode seeking
with application to localisation and tracking. In IEEE Int. Conf. on Comp.
Vis, 2005.
26