Zivkovic PRL 2006
Zivkovic PRL 2006
Zivkovic PRL 2006
a,*
Faculty of Science, University of Amsterdam, Kruislaan 403, 1098SJ Amsterdam, The Netherlands
b
University of Twente, P.O. Box 217, 7500AE Enschede, The Netherlands
Received 5 July 2004; received in revised form 17 August 2005
Abstract
We analyze the computer vision task of pixel-level background subtraction. We present recursive equations that are used to constantly
update the parameters of a Gaussian mixture model and to simultaneously select the appropriate number of components for each pixel.
We also present a simple non-parametric adaptive density estimation method. The two methods are compared with each other and with
some previously proposed algorithms.
2005 Elsevier B.V. All rights reserved.
Keywords: Background subtraction; On-line density estimation; Gaussian mixture model; Non-parametric density estimation
1. Introduction
A static camera observing a scene is a common case of
a surveillance system. Detecting intruding objects is an
essential step in analyzing the scene. An usually applicable
assumption is that the images of the scene without the
intruding objects exhibit some regular behavior that can
be well described by a statistical model. If we have a statistical model of the scene, an intruding object can be detected
by spotting the parts of the image that do not t the model.
This process is usually known as background subtraction.
In the case of common pixel-level background subtraction the scene model has a probability density function
for each pixel separately. A pixel from a new image is considered to be a background pixel if its new value is well
described by its density function. For a static scene the simplest model could be just an image of the scene without the
Corresponding author. Tel.: +31 20 525 7564; fax: +31 20 525 7490.
E-mail address: [email protected] (Z. Zivkovic).
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.patrec.2005.11.005
ARTICLE IN PRESS
2
Z. Zivkovic, F. van der Heijden / Pattern Recognition Letters xxx (2006) xxxxxx
selection criterion to choose the right number of components for each pixel on-line and in this way automatically
fully adapt to the scene.
The non-parametric density estimates also lead to exible models. The kernel density estimate was proposed for
background-subtraction in (Elgammal et al., 2000). A
problem with the kernel estimates is the choice of the xed
kernel size. This problem can be addressed using the variable-size kernels (Wand and Jones, 1995). Two simple
approaches are: the balloon estimator adapts the kernel
size at each estimation point; and the sample-point estimator adapts the kernel size for each data point. In (Mittal and Paragios, 2004) an elaborate hybrid scheme is used.
As the second contribution of the paper, we use here the
balloon variable-size kernel approach. We use uniform kernels for simplicity. The balloon approach leads to a very
ecient implementation that is equivalent to using a xed
uniform kernel (see Section 4). Finally, as the third contribution, we analyze and compare the standard algorithms
(Stauer and Grimson, 1999; Elgammal et al., 2000) and
the newly proposed algorithms.
The paper is organized as follows. In the next section,
we state the problem of the pixel-based background subtraction. In Section 3, we review the GMM approach from
Stauer and Grimson (1999) and present how the number
of components can be selected on-line to improve the algorithm. In Section 4, we review the non-parametric kernelbased approach from Elgammal et al. (2000) and propose
a simplication that leads to better experimental results.
In Section 5, we give the experimental results and analyze
them.
2. Problem denition
The value of a pixel at time t in RGB is denoted by ~
xt .
Some other color space or some local features could also be
used. For example, in (Mittal and Paragios, 2004) normalized colors and optical ow estimates were used. The pixelbased background subtraction involves decision if the pixel
belongs to the background (BG) or some foreground object
(FG). The pixel is more likely to belong to the background
if
pBGj~
xt
pFGj~
xt
p~
xt jBGpBG
p~
xt jFGpFG
is larger then 1 and vice versa. The results from the background subtraction are usually propagated to some higher
level modules, for example, the detected objects are often
tracked. While tracking an object we could obtain some
knowledge about the appearance of the tracked object
and this knowledge could be used to improve the background subtraction. This is discussed, for example, in (Harville, 2002; Withagen et al., 2002). In the general case we do
not know anything about the foreground objects that can
be seen nor when and how often they will be present.
Therefore we assume a uniform distribution for the appear-
M
X
m1
^2m I;
^m N~
x; ~
l^m ; r
p
ARTICLE IN PRESS
Z. Zivkovic, F. van der Heijden / Pattern Recognition Letters xxx (2006) xxxxxx
where ~
l^1 ; . . . ; ~
l^M are the estimates of the means and
2
2
^1 ; . . . ; r
^M are the estimates of the variances that describe
r
the Gaussian components. For computational reasons the
covariance matrices are kept isotropic. The identity matrix
I has proper dimensions. The estimated mixing weights
^m are non-negative and add up to one.
denoted by p
3.1. Update equations
Given a new data sample ~
xt at time t the recursive
update equations are (Titterington, 1984):
^m
p
~
l^
^m aot
^m ;
p
m p
t
^
~
l o a=^
pm ~
dm ;
^2m
r
^2m ot
^2m ;
dm r
r
pm ~
dm~
m a=^
4
5
6
where ~
dm ~
xt ~
l^m . Instead of the time interval T that
was mentioned above, here the constant a denes an exponentially decaying envelope that is used to limit the inuence of the old data. We keep the same notation having
in mind that eectively a = 1/T. For a new sample the ownership ot
m is set to 1 for the close component with largest
^m and the others are set to zero. We dene that a sample is
p
close to a component if the Mahalanobis distance from
the component is, for example, less than three. The squared
distance from the mth component is calculated as:
T
dm =^
D2m ~
xt ~
dm~
r2m . If there are no close components a
^M1 a; ~
l^M1 ~
xt
new component is generated with p
^M1 r0 where r0 is some appropriate initial variand r
ance. If the maximum number of components is reached
^m .
we discard the component with smallest p
The presented algorithm presents an on-line clustering
algorithm. Usually, the intruding foreground objects will
be represented by some additional clusters with small
^m . Therefore, we can approximate the background
weights p
model by the rst B largest clusters:
^
p~
xjXT ; BG
B
X
^m N~
x; ~
l^m ; r2m I.
p
m1
m1
t
nm 1 X
oi .
t i1 m
t
^t
The estimate from t samples is denoted as p
m and it can be
rewritten in a recursive form as a function of the estimate
^mt1 for t 1 samples and the ownership omt of the last
p
sample:
1
^mt p
^mt1 ot
^t1
p
p
.
m
t m
10
ARTICLE IN PRESS
4
Z. Zivkovic, F. van der Heijden / Pattern Recognition Letters xxx (2006) xxxxxx
where K
as
PM
m1
Pt
t
i1 om
^ m c=t
P
;
12
1 Mc=t
P
^ m 1 t ot is the ML estimate from (9) and the
where P
i1 m
t
bias from the prior is introduced through c/t. The bias decreases for larger data sets (larger t). However, if a small
bias is acceptable we can keep it constant by xing c/t to
cT = c/T with some large T. This means that the bias will
always be the same as if it would have been for a data
set with T samples. It is easy to show that the recursive
version of (11) with xed c/t = cT is given by
ot
cT
t
t1
t1
m
^m p
^m 1=t
^m
p
p
. 13
1=t
1 McT
1 McT
^mt
p
^m aot
^m acT .
p
m p
14
TV mtT
D
TV
15
where the kernel function Ku 1 if u < 1/2 and 0 otherwise. The volume V of the kernel is proportional to Dd
where d is the dimensionality of the data. Other smoother
kernel functions K are often used. For example, a Gaussian prole is used in (Elgammal et al., 2000). In practice
the kernel form K has little inuence but the choice of D
is critical (Wand and Jones, 1995). In (Elgammal et al.,
2000) the median med is calculated for the absolute dierxt1 k of the samples from XT and a simple
ences k~
xt ~
robust estimate
of the standard deviation is used D
p
med=0:68 2.
4.2. Simple balloon variable kernel density estimation
The kernel estimation is using one xed kernel size D for
the whole density function which might not be the best
choice (Wand and Jones, 1995). The so called balloon
estimator adapts the kernel size at each estimation point
~
x. Instead of trying to nd the globally optimal D, we could
increase the width D of the kernel for each new point ~
x
until a xed amount of data k is covered. In this way we
get large kernels in the areas with a small number of samples and smaller kernels in the densely populated areas.
This estimate is not a proper density estimate since the integral of the estimate is not equal to 1. There are many other
more elaborate approaches (Hall et al., 1995). Still the balloon estimate is often used for classication problems since
it is related to the k-NN classication (see Bishop, 1995,
p. 56). One nearest neighbor is common but to be more
robust to outliers we use k = [0.1T] where [ ] is the
round-to-integer operator.
The balloon approach leads to an ecient implementation that is equivalent to using a xed uniform kernel. Only
the choice for the threshold cthr from (2) is dierent. For
both the xed kernel and the balloon estimate the decision
that a new sample ~
x ts the model is made if there are more
than k points within the volume V (15). The kernel based
approach has V xed and k is the variable parameter that
can be used as the threshold cthr k from (2). For the
uniform kernel k is discrete and we get discontinuous
estimates. The balloon variable kernel approach in this
paper has the k xed and the volume V is the variable
parameter cthr 1/V 1/Dd. The problems with the discontinuities do not occur. An additional advantage is that
we do not estimate the sensitive kernel size parameter as in
(Elgammal et al., 2000).
4.3. Practical issues
In practice T is large and keeping all the samples in XT
would require too much memory and calculating (15)
would be too slow. It is reasonable to choose a xed number of samples K T and randomly select a sample from
each subinterval T/K. This might give too sparse sampling
of the interval T. In (Elgammal et al., 2000) the model is
split into a short-term model that has Kshort samples
form Tshort period and a long-term model with Klong
samples form Tlong. The short-term model contains a
denser sampling of the recent history. We use a similar
short-termlong-term strategy as in (Elgammal et al.,
ARTICLE IN PRESS
Z. Zivkovic, F. van der Heijden / Pattern Recognition Letters xxx (2006) xxxxxx
The plant from the scene was swaying because of the wind.
This sequence is taken by a low-quality web-camera. The
highly dynamic sequence Trees is taken from (Elgammal et al., 2000). This sequence has 857 frames. We will
analyze only the steady state performance and the performance with slow gradual changes. Therefore, the rst 500
frames of the sequences were not used for evaluation and
the rest of the frames were manually segmented to
generate the ground truth. Some experiments considering
adaptation to the sudden changes and the initialization
problems can be found in (Toyama et al., 1999, 2001,
2005). For both algorithms and for dierent threshold values (cthr from (2)), we measured the true positivespercentage of the pixels that belong to the intruding objects
that are correctly assigned to the foreground and the false
positivespercentage of the background pixels that are
incorrectly classied as the foreground. These are results
are plotted as the receiver operating characteristic (ROC)
curves (Egan, 1975) that are used for evaluation and
comparison (Zhang, 1996). For both algorithms, we use
a = 0.001.
5.1. Improved GMM
We compare the improved GMM algorithm with the
original algorithm (Stauer and Grimson, 1999) with a
xed number of components M = 4. In Fig. 1, we demonstrate the improvement in the segmentation results (the
ROC curves) and in the processing time. The reported processing time is for 320 240 images and measured on a
2 GHz PC. In the second column of Fig. 1, we also illustrate how the new algorithm adapts to the scene. The gray
values in the images indicate the selected number of components per pixel. Black stands for one Gaussian per pixel
and a pixel is white if a maximum of 4 components is used.
For example, the scene from the Lab sequence has a
monitor with rolling interference bars and the waving
plant. We see that the dynamic areas are modelled using
more components. Consequently, the processing time also
depends on the complexity of the scene. For the highly
dynamic Trees sequence the processing time is close to
that of the original algorithm (Stauer and Grimson,
1999). Intruding objects introduce generation of new components that are removed after some time (see the Trac
sequence). This also inuences the processing speed. For
simple scenes like the Trac often a single Gaussian
Table 1
A brief summary of the GMM and the non-parametric background subtraction algorithms
General steps
t
GMM
Non-parametric
Use (7)
Use (14), (5) and (6), see Section 3
for some practical issues
Use (8) to select the components of the
GMM that belong to the background
Use (16)
Add the new sample to XT and remove the
oldest one, see Section 4.3 for some practical issues
If (15) > cthr use the new sample for p~
xjXT ; BG
(set bm = 1 for the sample)
ARTICLE IN PRESS
6
Z. Zivkovic, F. van der Heijden / Pattern Recognition Letters xxx (2006) xxxxxx
Fig. 1. Comparison of the new proposed methods to the previous methods. The ROC curves are presented for the GMMs and the non-parametric (NP)
models. For the new GMM model we also present the selected number of mixture components using the new algorithm. We also report the average
processing times in the second column of the table.
ARTICLE IN PRESS
Z. Zivkovic, F. van der Heijden / Pattern Recognition Letters xxx (2006) xxxxxx
5.3. Comparison
In order to better understand the performance of the
algorithms we show the estimated decision boundary for
the background models for a pixel in Fig. 2a. The pixel
comes from the image area where there was a plant waving
because of the wind. This leads to a complex distribution.
The GMM tries to cover the data with two isotropic Gaussians. The non-parametric model is more exible and captures the presented complex distribution more closely.
Therefore the nonparametric method usually outperforms
the GMM method in complex situations as we can clearly
observe in Fig. 2b where we compare the ROC curves of
the two new algorithms. However, for a simple scene as
the Trac scene the GMM presents also a good model.
An advantage of the new GMM is that gives a compact
model which might be useful for some further postprocess-
Fig. 2. Comparison of the new GMM algorithm and the new non-parametric (NP) method: (a) an illustration of how the models t the data. The
estimated models are presented for a certain threshold for the frame 840 of the Laboratory sequence and for the pixel (283, 53) (the pixel is in the area of
the waving plant above the monitor, see Fig. 1), (b) ROC curves for comparison, (c) the convex hull surfaces (Pareto front) that represent the best possible
performance of the algorithms for dierent parameter choices.
ARTICLE IN PRESS
8
Z. Zivkovic, F. van der Heijden / Pattern Recognition Letters xxx (2006) xxxxxx
Hall, P., Hui, T.C., Marron, J.S., 1995. Improved variable window kernel
estimates of probability densities. Ann. Statist. 23 (1), 110.
Harville, M., 2002. A framework for high-level feedback to adaptive,
per-pixel, mixture-of-Gaussian background models. In: Proc. of the
European Conf. on Computer Vision.
Hayman, E., Eklundh, J.-O., 2003. Statistical background subtraction for
a mobile observer. In: Proc. of the Internat. Conf. on Computer
Vision. pp. 6774.
KaewTraKulPong, P., Bowden, R., 2001. An improved adaptive background mixture model for real-time tracking with shadow detection.
In: Proc. of 2nd European Workshop on Advanced Video Based
Surveillance Systems.
Kato, J., Joga, S., Rittscher, J., Blake, A., 2002. An HMM-based
segmentation method for trac monitoring movies. IEEE Trans.
Pattern Anal. Mach. Intell. 24 (9), 12911296.
Lee, D.-S., 2005. Eective Gaussian mixture learning for video background subtraction. IEEE Trans. Pattern Anal. Mach. Intell. 27 (5),
827832.
Mittal, A., Paragios, N., 2004. Motion-based background subtraction
using adaptive kernel density estimation. In: Proc. of the Conf. on
Computer Vision and Pattern Recognition.
Monnet, A., Mittal, A., Paragios, N., Ramesh, V., 2003. Background
modeling and subtraction of dynamic scenes. In: Proc. of the Internat.
Conf. on Computer Vision. pp. 13051312.
Pareto, V., 1971. Manual of political economy, A.M. Kelley, New York
(Original in French 1906).
Power, P.W., Schoonees, J.A., 2002. Understanding background mixture
models for foreground segmentation. In: Proc. of the Image and Vision
Computing New Zealand.
Prati, A., Mikic, I., Trivedi, M., Cucchiara, R., 2003. Detecting moving
shadows: Formulation, algorithms and evaluation. IEEE Trans.
Pattern Anal. Mach. Intell. 25 (7), 918924.
Stauer, C., Grimson, W., 1999. Adaptive background mixture models for
real-time tracking. In: Proc. of the Conf. on Computer Vision and
Pattern Recognition. pp. 246252.
Stenger, B., Ramesh, V., Paragios, N., Coetzec, F., Buhmann, J.M., 2001.
Topology free hidden Markov models: application to background
modeling. In: Proc. of the Internat. Conf. on Computer Vision.
Toyama, K., Krumm, J., Brumitt, B., Meyers, B., 1999. Wallower:
principles and practice of background maintenance. In: Proc. of the
Internat. Conf. on Computer Vision.
Titterington, D., 1984. Recursive parameter estimation using incomplete
data. J. Roy. Statist. Soc., Ser. B (Methodological) 2 (46), 257
267.
Wand, M., Jones, M., 1995. Kernel Smoothing. Chapman and Hall,
London.
Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A., 1997. Pnder:
Real-time tracking of the human body. IEEE Trans. Pattern Anal.
Mach. Intell. 19 (7), 780785.
Withagen, P.J., Schutte, K., Groen, F., 2002. Likelihood-based object
tracking using color histograms and EM. In: Proc. of the Internat.
Conf. on Image Processing. pp. 589592.
Zhang, Y., 1996. A survey on evaluation methods for image segmentation.
Pattern Recognition 29, 13351346.
Zivkovic, Z., van der Heijden, F., 2004. Recursive unsupervised learning
of nite mixture models. IEEE Trans. Pattern Anal. Mach. Intell.
26 (5), 651656.