Understanding Bag-Of-Words Model: A Statistical Framework

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Int. J. Mach. Learn. & Cyber.

(2010) 1:43–52
DOI 10.1007/s13042-010-0001-0

ORIGINAL ARTICLE

Understanding bag-of-words model: a statistical framework


Yin Zhang • Rong Jin • Zhi-Hua Zhou

Received: 27 February 2010 / Accepted: 2 July 2010 / Published online: 28 August 2010
 Springer-Verlag 2010

Abstract The bag-of-words model is one of the most Keywords Object recognition  Bag of words model 
popular representation methods for object categorization. Rademacher complexity
The key idea is to quantize each extracted key point into
one of visual words, and then represent each image by a
histogram of the visual words. For this purpose, a clus- 1 Introduction
tering algorithm (e.g., K-means), is generally used for
generating the visual words. Although a number of studies Inspired by the success of text categorization [6,10], a
have shown encouraging results of the bag-of-words rep- bag-of-words representation becomes one of the most
resentation for object categorization, theoretical studies on popular methods for representing image content and has
properties of the bag-of-words model is almost untouched, been successfully applied to object categorization. In a
possibly due to the difficulty introduced by using a heu- typical bag-of-words representation, ‘‘interesting’’ local
ristic clustering process. In this paper, we present a sta- patches are first identified from an image, either by densely
tistical framework which generalizes the bag-of-words sampling [14,25] or by a interest point detector [9]. These
representation. In this framework, the visual words are local patches, represented by vectors in a high dimensional
generated by a statistical process rather than using a clus- space [9], are often referred to as the key points.
tering algorithm, while the empirical performance is To efficiently handle these key points, the key idea is to
competitive to clustering-based method. A theoretical quantize each extracted key point into one of visual words,
analysis based on statistical consistency is presented for the and then represent each image by a histogram of the visual
proposed framework. Moreover, based on the framework words. This vector quantization procedure allows us to
we developed two algorithms which do not rely on clus- represent each image by a histogram of the visual words,
tering, while achieving competitive performance in object which is often referred to as the bag-of-words representa-
categorization when compared to clustering-based bag-of- tion, and consequently converts the object categorization
words representations. problem into a text categorization problem. A clustering
procedure (e.g., K-means) is often applied to group key
points from all the training images into a large number of
clusters, with the center of each cluster corresponding to a
Y. Zhang  Z.-H. Zhou (&)
different visual word. Studies [3,20] have shown promising
National Key Laboratory for Novel Software Technology,
Nanjing University, Nanjing 210093, China performance of bag-of-words representation in object cat-
e-mail: [email protected] egorization. Various methods [7,8,12,13,17,21,25] have
Y. Zhang been proposed for the visual vocabulary construction to
e-mail: [email protected] improve both the computational efficiency and the classi-
fication accuracy of object categorization. However, to the
R. Jin
best of our knowledge, there is no theoretical analysis on
Department of Computer Science & Engineering,
Michigan State University, East Lansing, MI 48824, USA the statistical properties of vector quantization for object
e-mail: [email protected] categorization.

123
44 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52

In this paper, we present a statistical framework which algorithm is proposed to reduce the visual vocabulary that
generalizes the bag-of-words representation and aim to is initially obtained by K-means, into a more descriptive
provide a theoretical understanding for vector quantization and compact one. Farquhar et al. [5] model the problem as
and its effect on object categorization from the viewpoint Gaussian mixture model where each visual words corre-
of statistical consistency. In particular, we view sponds to a Gaussian component and use the maximum a
posterior (MAP) approach to learn the parameter. A
1. each visual word as a quantization function fk ðxÞ that is
method based on mean-shift is proposed in [7] for vector
randomly sampled from a class of functions F by an
quantization to resolve the problem that K-means tends to
unknown distribution P F ; and
‘starve’ medium density regions in feature space and each
2. each key point of an image as a random sample from
key point is allocated to the first visual word similar to it.
an unknown distribution qi ðxÞ:
Moosmann et al. [12] use extremely randomized clustering
The above statistical description of key points and visual forests to efficiently generate a highly discriminative cod-
words allows us to interpret the similarity between two ing of visual words. To minimize the loss of information in
images in bag-of-words representation, the key quantity in vector quantization, Lazebnik and Raginsky [8] try to seek
object categorization, as an empirical expectation over a compressed representation of vectors that preserve the
the distributions qi ðxÞ and P F : Based on the proposed sufficient statistics of features. In [16], images are char-
statistical framework, we present two random algorithms acterized using a set of category-specific histograms
for vector quantization, one based on the empirical describing whether the content can best be modeled by the
distribution and the other based on kernel density estima- universal vocabulary or the specific vocabulary. Tuytelaars
tion. We show that both random algorithms for vector and Schmid [21] propose a quantization method that
quantization are statistically consistent in estimating the discretizes a feature space by a regular lattice. van
similarity between two images. Our empirical study with Gemert et al. [22] use kernel density estimation to
object recognition also verifies that the two proposed avoid the problem of ‘codeword uncertainty’ and ‘code-
algorithms (I) yield recognition accuracy that is compara- word plausibility’.
ble to the clustering based bag-of-words representation, Although many studies have shown encouraging results
and (II) are resilient to the number of visual words when of the bag-of-words representation for object categoriza-
the number of training examples is limited. The success of tion, none of them provide statistical consistency analysis,
the two simple algorithms validates the proposed statistical which reveals the asymptotic behavior of the bag-of-words
framework for vector quantization. model for object recognition. Unlike the existing statistical
The rest of this paper is organized as follows. Section 2 approaches for key point quantization that are designed to
presents the overview of existing approaches for key point reduce the training error, the proposed framework gener-
quantization that were used by object recognition. Sec- alizes the bag-of-words model by the statistical expecta-
tion 3 presents a statistical framework that generalizes the tion, making it possible to analyze the statistical
classical bag-of-words representation, and two random consistency of the bag-of-words model. Finally, we would
algorithms for vector quantization based on the proposed like to point out that although several randomized
framework. We show that both algorithms are statistically approaches [12,14,24] have been proposed for key point
consistent in estimating the similarity between two images. quantization, none of them provides theoretical analysis on
Empirical study with object recognition reported in Sect. 4 statistical consistency. In contrast, we present not only the
shows encouraging results of the proposed algorithms for theoretic results for the two proposed random algorithms
vector quantization, which in return validates the proposed for vector quantization, but also the results of the empirical
statistical framework for the bag-of-words representation. study with object recognition that support the theoretic
Section 5 concludes this work. claim.

2 Related work 3 A statistical framework for bag-of-words


representation
In object recognition and texture analysis, a number of
algorithms have been proposed for key point quantization. In this section, we first present a statistical framework for
Among them, K-means is probably the most popular the bag-of-words representation in object categorization,
one. To reduce the high computational cost of K-means, followed by two random algorithms that are derived from
hierarchical K-means is proposed in [13] for more effi- the proposed framework. The analysis of statistical con-
cient vector quantization. In [25], a supervised learning sistency is also presented for the two proposed algorithms.

123
Int. J. Mach. Learn. & Cyber. (2010) 1:43–52 45

3.1 A statistical framework pairwise similarity plays a critical role in any pattern
classification problems including object categorization.
We consider the bag-of-words representation for images, According to the learning theory [18], it is the pairwise
with each image being represented by a collection of local similarity, not the vector representation of images, that
descriptors. We denote by N the number of training images, decides the classification performance. Using the vector
and by Xi ¼ ðx1i ; . . .; xni i Þ the collection of key points used representation hi and hj ; the similarity between two images
to represent image I i where xli 2 X ; l ¼ 1; . . .; ni is a key I i and I j ; denoted by sij ; is computed as
point in feature space X : To facilitate statistical analysis, 1 1X m

we assume that each key point xli in Xi is randomly sij ¼ hTi hj ¼ Ei ½fk ðxÞEj ½fk ðxÞ ð4Þ
m m k¼1
drawn from an unknown distribution qi ðxÞ associated with
image I i : Similar to the previous analysis, the summation in the
The key idea of the bag-of-words representation is to above expression can be viewed as an empirical
quantize each key point into one of the visual words that expectation over the sampled quantization functions
are often derived by clustering. We generalize this idea of fk ðxÞ; k ¼ 1; . . .; m: We thus generalize the definition of
quantization by viewing the mapping to a visual word vk 2 pairwise similarity in Eq. 4 by replacing the empirical
X as a quantization function fk ðxÞ : X 7! ½0; 1: Due to the expectation with the true expectation, and obtain the true
uncertainty in constructing the vocabulary, we assume that similarity between two images I i and I j as
the quantization function fk ðxÞ is randomly drawn from a  
sij ¼ Ef  P F Ei ½f ðxÞEj ½f ðxÞ ð5Þ
class of functions, denoted by F ; via a unknown distribu-
tion P F : To capture the behavior of quantization, we According to the definition in Eq. 1, each quantization
design the function class F as follows function is parameterized by a center v: Thus, to define P F ;
F ¼ ff ðx; vÞjf ðx; vÞ ¼ Iðkx  vk  qÞ; v 2 X g ð1Þ it suffices to define a distribution for the center v; denoted
by qðvÞ: Thus, Eq. 5 can be expressed as
where indicator function I(z) outputs 1 when z is true, or 0  
otherwise. In the above definition, each quantization sij ¼ Ev Ei ½f ðxÞEj ½f ðxÞ : ð6Þ
function f ðx; vÞ is essentially a ball of radius q centered at
v: It outputs 1 when a point x is within the ball, and 0 if x is
3.2 Random algorithms for key point quantization
outside the ball. This definition of quantization function is
and their statistical consistency
clearly related to the vector quantization by data clustering.
Based on the above statistical interpretation of key
We emphasize that the pairwise similarity in Eq. 6 can not
points and quantization functions, we can now provide a
be computed directly. This is because both distributions
statistical description for the histogram of visual words,
k
qi ðxÞ and qðvÞ are unknown, which makes it intractable to
which is the key of bag-of-words representation. Let h^ i compute Ei ½ and Ev ½: In real applications, approxima-
denotes the normalized number of key points in image I i tions are needed. In this section, we study how approxi-
that are mapped to visual word vk : Given m visual words, mations will affect the estimation of pairwise similarity. In
or m quantization functions ffk ðxÞgmk¼1 that are sampled particular, given the pairwise similarity estimated by dif-
k ferent kinds of approximated distributions, we aim to
^
from F ; h is computed as
i
bound its difference to the underlying true similarity. To
k 1X ni
 
h^i ¼ fk xli ¼ E
^ i ½fk ðxÞ ð2Þ simplify our analysis, we assume that each image has at
ni j¼1 least n key points.
By assuming that the key points in all the images are
where E ^ i ½fk ðxÞ stands for the empirical expectation of
sampled from qðvÞ; we have an empirical distribution for
function fk ðxÞ based on the samples x1i ; . . .; xni i : We can qðvÞ; i.e.,
generalize the above computation by replacing the
empirical expectation E ^ i ½fk ðxÞ with an expectation over 1 X
N X
ni
 
q^ðvÞ ¼ PN d v  xli ð7Þ
the true distribution qi ðxÞ; i.e., i¼1 ni i¼1 l¼1
Z
R
hki ¼ Ei ½fk ðxÞ ¼ d xqi ðxÞfk ðxÞ: ð3Þ where dðxÞ is a Dirac delta function that dðxÞ dx ¼ 1
and dðxÞ ¼ 0 for x 6¼ 0: Direct estimation of pairwise
The bag-of-words representation for image I i is expressed similarities using the above empirical distribution is
by vector hi ¼ ðh1i ; . . .; hm
i Þ: computationally expensive, because the number of key
In the next step, we analyze the pairwise similarity points in all images can be very large. In the bag-of-words
between two images. It is important to note that the model, m visual words are used as prototypes for the key

123
46 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52

points in all the images. Let v1 ; . . .; vm be the m visual  


22
words randomly sampled from the key points in all the Prðjf ðvÞ  Eðf ðvÞÞj  Þ  2 exp Pn 2
ð12Þ
i¼1 ci
images. The empirical distribution q^ðvÞ is
Using the McDiarmid inequality, we have the following
1X m
theorem which bounds jsij  sij j:
q^ðvÞ ¼ dðv  vk Þ ð8Þ
m k¼1
Theorem 2 Assuming fk ðxÞ; k ¼ 1; . . .; m are randomly
In the next step, we aim to approximate the unknown
drawn from class F according to an unknown distribution.
distribution qi ðxÞ in two different ways, and show the
And further assuming that any function in F is universally
statistical consistency for each approximation.
bounded between 0 and 1. With probability 1 - d, the
following inequality holds for any two training images I i
3.2.1 Empirically estimated density function for qi ðxÞ and I j
rffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 2
First we approximate qi ðxÞ by the empirical distribution jsij  sij j  ln ð13Þ
2m d
q^i ðxÞ defined as follows
Proof For any f 2 F ; we have 0  Ei ½f ðxÞEj ½f ðxÞ  1:
1X ni
q^i ðxÞ ¼ dðx  xli Þ ð9Þ Thus, for any k, ck B 1/m. By setting
ni l¼1 rffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 2
 1 2
Given the approximations for distribution qi ðxÞ and qðvÞ; d ¼ 2 exp 2m ; or  ¼ ln ; ð14Þ
2m d
we can now compute the approximation of the pairwise
similarity sij defined in Eq. 6. For Eq. 9, the pairwise we have Prðjsij  sij j  Þ  1  d:
similarity, denoted by ^ sij ; is computed as The above theorem indicates that, if we have the true
  distribution qi ðxÞ of each image I i ; with a large number of
sij ¼ E
^ ^v E^ i ½f ðxÞE^ j ½f ðxÞ
! sampled quantization functions fk ðxÞ; we have a very good
1X m
1X ni  l



 chance to recover the true similarity sij with a small error.
¼ I xi  v k  q
m k¼1 ni l¼1 The next theorem bounds j^ sij  sij j:
nj
!
1X  l  Theorem 3 Assuming each image has at least n ran-
 Iðxj  vk   qÞ ð10Þ
nj l¼1 domly sampled key points. Also assuming that fk ðxÞ; k ¼
1; . . .; m randomly drawn from an unknown distribution
To show the statistical consistency of s^ij ; we need to over class F : With probability 1 - d, the following
bound jsij  s^ij j: Since there are two approximate distri- inequality is satisfied for any two images I i and I j
bution used in our estimation, we divide our analysis into rffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
two steps. First, we measure j sij  sij j; i.e., the difference 1 2 1 4m2
j^sij  sij j  ln þ 2 ln ð15Þ
in similarity caused by the approximate distribution for 2m d 2n d
P F : Next, we measure j^ sij  sij j; i.e., the difference Proof We first need to bound the difference between
caused by using the approximate distribution for qi ðxÞ: E
^ i ½fk ðxÞ and Ei ½fk ðxÞ: Since 0  f ðxÞ  1 for any f 2 F ;
The overall difference jsij  s^ij j is bounded by the sum of using McDimard inequality, we have
the two difference. 


We first state the McDiarmid inequality [11], which is Pr
E ^ i ½fk ðxÞ  Ei ½fk ðxÞ
   expð2n2 Þ ð16Þ
used throughout our analysis.
By setting
Theorem 1 (McDiarmid Inequality) Given independent sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 2 ffi
d 1 4m
random variables v1, v2,..., vn, v0 i [ V, and a function f : 2 expð2n Þ ¼2
; or  ¼ ln
V n 7! R satisfying 2m2 2n d

sup jf ðvÞ  f ðv0 Þj  ci ð11Þ with probability 1 - d/2, we have jE ^ i ½fk ðxÞ  Ei ½fk ðxÞj  
v1 ;v2 ;...;vn ;v0i 2V
and jE^ j ½fk ðxÞ  Ej ½fk ðxÞj   for all fk ðxÞm
k¼1 simulta-
where v ¼ ðv1 ; v2 ; . . .; vn Þ and v0 ¼ ðv1 ; v2 ; . . .; vi1 ; v0i ; neously. As a result, with probability 1 - d/2, for any two
viþ1 ; . . .; vn Þ; then the following statement holds image I i and I j ; we have

123
Int. J. Mach. Learn. & Cyber. (2010) 1:43–52 47

1X m where jðx; x0 Þ : X  X 7!Rþ is a local kernel function with


j^sij  sij j  ^kE
jE ^ k  Ek Ek j R
m k¼1 i j i j jðx; x0 Þ dx0 ¼ 1: B controls the functional norm of qðxÞ in
the reproducing kernel Hilbert space Hj : An example of
1X m
^ k  E k ÞE
^ k j þ jEk ðE
^ k  Ek Þj
 jðE i i j i j j jðx; x0 Þ is RBF function, i.e. jðx; x0 Þ / expðkdðx; x0 Þ2 Þ;
m k¼1
where dðx; x0 Þ ¼ kx  x0 k2 : Then, the distribution qi ðxÞ is
1X m
^ k  Ek j þ jE ^ k  Ek j approximated by a kernel density estimation q~i ðxÞ defined
 jE i i j j
m k¼1 as follows
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 2 ffi
1 4m X
ni
 
 2 ¼ 2 ln ð17Þ q~i ðxÞ ¼ ali j x; xli ; ð21Þ
2n d l¼1

^ k stands for E where ali(1 B l B ni) are the combination weight that sat-
where E i
^ i ½fk ðxÞ for simplicity. According to Pni l
Theorem 2, with probability 1 - d/2, we have isfy (i) ali C 0, (ii) 2
l¼1 ai ¼ 1; and (iii) aiKiai B B ,
l l0
rffiffiffiffiffiffiffiffiffiffiffiffiffiffi where Ki ¼ ½jðxi ; xi Þni ni :
1 2 Using the kernel density function, we approximate the
sij  sij j 
j ln ð18Þ
2m d pairwise similarity for Eq. 21 as follows
Combining Eqs. 17 and 18, we have the result in the  
~sij ¼ E
^v E~ i ½f ðxÞE
~ j ½f ðxÞ
! n !
theorem. With probability 1 - d, the following inequality
1X m Xni
  X j 
is satisfied ¼ al h xl ; v k alj h xlj ; vk ð22Þ
rffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi m k¼1 l¼1 i i l¼1
1 2 1 4m2
j^
sij  sij j  ln þ 2 ln ð19Þ
2m d 2n d where function hðx; vÞ is defined as
Z
Remark Theorem 3 reveals an interesting relationship hðx; vÞ ¼ dz I ðdðz; vÞ  qÞjðx; zÞ ð23Þ
between the estimation error jsij  ^ sij j and the number of
quantization functions (or the number of visual words). The
To bound the difference between s~ij and sij, we follow the
upper bound in Theorem 3 consists of two terms: the first
pffiffiffiffi analysis [19] by viewing Ei ½f ðxÞEj ½f ðxÞ as a mapping,
term decreases at a rate of Oð1= mÞ while the second term
denoted by g : F 7! Rþ ; i.e.,
increases at a rate of Oðln mÞ: When the number of visual
words m is small, the first term dominates the upper gðf ; qi ; qj Þ ¼ Ei ½f ðxÞEj ½f ðxÞ ð24Þ
bound, and therefore increasing m will reduce the difference
The domain for function g, denoted by G; is defined as
j^
sij  sij j: As m becomes significantly larger than n, the 

second term will dominate the upper bound, and therefore G ¼ g : F 7! Rþ
9qi ; qj 2 F D s.t. gðf Þ ¼ Ei ½f ðxÞEj ½f ðxÞ
increasing m will lead to a larger j^ sij  sij j: This result ð25Þ
appears to be consistent with the observations on the size
of the visual vocabulary: a large vocabulary tends to To bound the complexity of a class of functions, we
performance well in object categorization; but, too many introduce the concept of Randemacher complexity [2]:
visual words could deteriorate the classification accuracy. Definition 1 (Randemacher complexity) Suppose x1,..., xn
Finally, we emphasize that although the idea of vector are sampled from a set X with i.i.d. Let F be a class of
quantization by randomly sampled centers was already functions mapping from X to R: The Randemacher com-
discussed in [7,24], to the best of our knowledge, this is the plexity of F is defined as
first work that presents its statistical consistency analysis. !
2X n
Rn ðF Þ ¼ Ex1 ;...;xn ;r sup ri f ðxi Þ ð26Þ
f 2F n i¼1
3.2.2 Kernel density function estimation for qi ðxÞ
where ri is independent uniform ± 1-valued random
In this section, we approximate qi ðxÞ by a kernel density variables.
estimation. To this end, we assume that the density func-
tion qi ðxÞ belongs to a family of smooth functions F D that Assuming at least n key points are randomly sampled
is defined as follows from each image, we have the following lemmas that

Z bounds the complexity of domain G :

F D ¼ qðxÞ : X 7!Rþ
hqðxÞ; qðxÞiHj  B ; qðxÞ dx ¼ 1
2
Lemma 1 The Rademacher complexity of function class
ð20Þ G; denoted by Rm ðGÞ; is bounded as

123
48 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52

R 
Ef
dxjf ðxÞj From [2], we have the following lemmas:
Rm ðGÞ  2BCj pffiffiffiffi ð27Þ
m
pffiffiffiffiffiffiffiffiffiffiffiffiffi Lemma 2 (Theorem 12 in [2]) For 1 B q \ ?, let L ¼
where Cj ¼ maxx;z jðx; zÞ fjf  hjq : f 2 F g; wherehand kf  hk1 is uniformly
Proof Denote F = {f1,..., fm}, according to the definition, bounded. We have
 
we have khk1
" # Rn ðLÞ  2qkf  hk1 Rn ðF Þ þ pffiffiffi ð30Þ
n
2X m
Rm ðGÞ ¼ Er;F sup rk gðfk Þ Lemma 3 (Theorem 8 in [2]) With probability 1 - d the
g2G m k¼1
" "
## following inequality holds
2X m
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

¼ EF Er sup rk gðfk Þ
F 8 lnð2=dÞ
g2G m k¼1

E/ðY; f ðXÞÞ  E
^ n /ðY; f ðXÞÞ þ Rn ð/ F Þ þ
" "
## n
2X m


ð31Þ
¼ EF Er sup rk Ei ½fk Ej ½fk 
F
qi ;qj 2F D m k¼1

" "
## where /(x, y) is the loss function, n is the number
2X m
of samples and / F ¼ fðx; yÞ 7! /ðy; f ðxÞÞ  /ðy; 0Þ :

 EF Er sup rk Ei ½fk 
F
kxi k  B m
f 2 F g:
k¼1
" " * +
##
2 X m
Based on the above lemmas, we have the following

¼ EF Er sup xi ; rk Uk
F theorem
m kxi k  B k¼1

where Uk ¼ ðh/1 ðÞ; fk ðÞi; h/2 ðÞ; fk ðÞi; . . .Þ Theorem 4 Assume that the density function
qi ðxÞ; qj ðxÞ 2 F D : Let q~i ðxÞ; q~j ðxÞ 2 F D be an estimated
and /k ðxÞ is an eigen function
density function from n sampled key points. We have, with
of jðx; x0 Þ probability 1 - d, the following inequality holds
" " 
##
2B X m 

 
Ef ½jgðf ; q~i ; q~j Þ  gðf ; qi ; qj Þj  E ^ f ½jgðf ; q~i ; q~j Þ  gðf ; q^i ; q^j Þj
 EF Er  rk Uk 
F
m  
ð28Þ  R  
k¼1 Ef dxjf ðxÞj 1
2 2 !12

33 þ 2 2BCj pffiffiffiffi þ pffiffiffiffi


2B 4 4 X
m m
¼ EF Er rk rt hUk ; Ut i

F 55 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
m k;t
lnð8=dÞ lnð8m =dÞ
þ þ2 ð32Þ
2 !12 3 2m 2n
2B 4 X
 EF Er ½rk rt hUk ; Ut ðxÞijF  5 Proof From Lemma 3, with probability 1 - d/2, we have
m k;t
2 Ef ½jgðf ; q~i ; q~j Þ  gðf ; qi ; qj Þj
!1 3
2B 4 X  2
 2 E^ f ½jgðf ; q~i ; q~j Þ  gðf ; qi ; qj Þj
¼ EF Er rk hUk ; Uk i
F 5 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
m k 8 lnð4=dÞ
2 þ Rm ðjG  gðf ; qi ; qj ÞjÞ þ ð33Þ
!12 3 m
2B 4 X
¼ EF hU k ; U k i 5 Since 0 B g(f; qi, qj) B 1, using the results in Lemma 1
m k and 2, we have
2 !12 3  
Z  
2B 4 X 1
Rm jG  gðf ; qi ; qj Þj  2 Rm ðGÞ þ pffiffiffiffi
¼ EF dz dxfk ðxÞfk ðzÞjðx; zÞ 5 m
m R 
k  
R  Ef dxjf ðxÞj 1
Ef dxjf ðxÞj  2 2BCj pffiffiffiffi þ pffiffiffiffi
 2BCj pffiffiffiffi m m
m
ð34Þ
where the first inequality is because Ej ½fk   1; the second
Hence, we have, with probability 1 - d/2 the following
inequality is from Cauchy’s inequality, the third and fourth
inequality holds
inequalities are from Jensen’s inequality. The last equality
follows Ef ½jgðf ; q~i ; q~j Þ  gðf ; qi ; qj Þj  E
^ f ½jgðf ; q~i ; q~j Þ  gðf ; qi ; qj Þj
X Z  R   rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Ef dxjf ðxÞj 1 8 lnð4=dÞ
hUk ; Uk i ¼ h/i ðÞ; fk ðÞi2 ¼ dz dxfk ðxÞfk ðzÞjðx; zÞ þ 2 2BCj pffiffiffiffi þ pffiffiffiffi þ
i m m m
ð29Þ ð35Þ

123
Int. J. Mach. Learn. & Cyber. (2010) 1:43–52 49

Next, we aim to bound E


^ f ½jgðf ; q~i ; q~j Þ  gðf ; qi ; qj Þj: Note Remark Theorem 4 bounds the true expectation of the
that difference between the similarity estimated by kernel
density function and the true similarity. Similar to
E
^ f ½jgðf ; q~i ; q~j Þ  gðf ; qi ; qj Þj
Theorem 3, this bound also consists of a term decreasing
pffiffiffiffi
1X m
at a rate of Oð1= mÞ and a term increasing at a rate of
¼ jgðfk ; q~i ; q~j Þ  gðfk ; qi ; qj Þj
m k¼1 Oðln mÞ: What’s more, we can see in order to minimize the
true expectation of the difference between the similarity
1X m 
 jgðfk ; q~i ; q~j Þ  gðfk ; q^i ; q^j Þj þ jgðfk ; q^i ; q^j Þ estimated by kernel density function and the true similarity,
m k¼1
 we need to minimize the empirical expectation of the
gðfk ; qi ; qj Þj ð36Þ difference between the similarity estimated by kernel
density function and the similarity estimated by empirical
Using the same logistics in the proof of Theorem 3, we
density function. If jðx; vÞ decreases exponentially as
have, with probability 1 - d/2
dðx; vÞ decreases, such as Gaussian kernel, we have hðx; vÞ
1X m
close to 1 when dðx; vÞ  q while hðx; vÞ close to 0 when
jgðfk ; q^i ; q^j Þ  gðfk ; qi ; qj Þj
m k¼1 dðx; vÞ [ q: In such circumstance, setting ali = 1/ni for all
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 B l B ni is a good choice for the approximation and is
lnð8=dÞ lnð8m2 =dÞ
 þ2 ð37Þ also very efficient since we do not need to learn a.
2m 2n
Note that although the idea of kernel density estimation
From the above results, we have, with probability 1 - d/2,
was already proposed in some studies [e.g., 22], to the best
the following inequality holds
of our knowledge, this is the first work that reveals the
1X m
statistical consistency of kernel density estimation for the
jgðfk ; q~i ; q~j Þ  gðfk ; qi ; qj Þj
m k¼1 bag-of-words representation.
1X m
 jgðfk ; q~i ; q~j Þ  gðfk ; q^i ; q^j Þj
m k¼1
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 4 Empirical study
lnð8=dÞ lnð8m2 =dÞ
þ þ2 ð38Þ
2m 2n In this empirical study, we aim to verify the proposed
Combining the above results together, we have, with framework and the related analysis. To this end, based on
probability 1 - d, the following inequality holds the discussion in Sect. 3.2, we present two random algo-
Ef ½jgðf ; q~i ; q~j Þ  gðf ; qi ; qj Þj rithms for vector quantization that are shown in Algo-
rithm 1. We refer to the algorithm based on empirical
E^ f ½jgðf ; q~i ; q~j Þ  gðf ; q^i ; q^j Þj
 R   distribution as ‘‘Quantization via Empirical Estimation’’,
Ef dxjf ðxÞj 1 or QEE for short, and to the algorithm based on kernel
þ 2 2BCj p ffiffiffiffi þ p ffiffiffi

m m density estimation as ‘‘Quantization via Kernel Estima-
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
lnð8=dÞ lnð8m2 =dÞ tion’’, or QKE for short. Note that since both vector
þ þ2 ð39Þ quantization algorithms do not rely on the clustering
2m 2n
algorithms to identify visual words, they are in general
In our empirical study, we will use RBF kernel function computationally more efficient. In addition, both algo-
for jðx; x0 Þ with ali = 1/ni. The corollary below shows the rithms have error bounds decreases at the rate of
bound for this choice of kernel density estimation. pffiffiffiffi
Oð1= mÞ when the number of key points n is large,
Corollary 5 When the kernel function jðx; x0 Þ ¼ indicating that they are robust to the number of visual
d=2 words m. We emphasize that although similar random
ð1=ð2pr2 ÞÞ expðkx  x0 k22 =ð2r2 ÞÞ and ali = 1/ni, the
algorithms for vector quantization have been discussed in
bound in Theorem 4 becomes
[5,17,22,14,24], the purpose of this empirical study is to
Ef ½jgðf ; q~i ; q~j Þ  gðf ; qi ; qj Þj verify that
 d=2   
 1=ð2pr2 Þ 1  exp q2 =ð2r2 Þ • simple random algorithms deliver similar performance
R  pffiffiffiffi
2Ef dxjf ðxÞj = ni þ 1 of object recognition as the clustering based algorithm,
þ2 pffiffiffiffi and
m
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi • the random algorithms are robust to the number of
lnð8=dÞ lnð8m2 =dÞ visual words, as predicted by the statistical consistency
þ þ2 ð40Þ
2m 2n analysis.

123
50 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52

Finally in the implementation of QKE, to efficiently where d is the average distance between all the key points
calculate h-function, we approximate it as [1] and the randomly selected centers. A RBF kernel is used in
2ðd~  q~ Þ2  1 QKE with the kernel width r is set as 0:75d according to
h
pffiffiffi our experience. Binary linear SVM is used for each clas-
4 pðd~  q ~Þ3 expðd~  q~ Þ2
2 sification problem. To examine the sensitivity to the
2ðd~ þ q~Þ  1
 pffiffiffi ð41Þ number of visual words, for both data sets, we varied the
4 pðd~ þ q ~Þ3 expðd~ þ q~ Þ2 number of visual words from 10 to 10,000, as shown in
where d~ ¼ d=r; q
~ ¼ q=r and r is the width of the Gaussian Figs. 1 and 2.
kernel. First, we observe that the proposed algorithms for vector
quantization yield comparable if not better performance
than the K-means clustering algorithm. This confirms the
proposed statistical framework for key point quantization is
effective. Second, we observe that the clustering based
approach for vector quantization tends to perform worse,
sometimes very significantly, when the number of visual
words is large. We attribute this instability to the fact that
K-means requires each interest point belongs to exactly one
visual word. If the number of clusters is not appropriate, for
example, too large compared to the number of instances,
two relevant key points may be separated into different
clusters although they are both very near to the boundary. It
will lead to a poor estimation of pairwise similarity. The
problem of ‘‘hard assignment’’ was also observed in
[17,22]. In contrast, for the proposed algorithms, we
observe a rather stable improvement as the number of
visual words increases, consistent with our analysis in
Two data sets are used in our study: PASCAL VOC statistical consistency.
Challenge 2006 data set [4] and Graz02 data set [15].
PASCAL06 contains 5,304 images from 10 classes. We
randomly select 100 images for training and 500 for test-
ing. The Graz02 data set contains 365 bike images, 420 car 5 Conclusion
images, 311 people images and 380 background images.
We randomly select 100 images from each class for The bag-of-words model is one of the most popular repre-
training, and use the remaining for testing. By using a sentation methods for object categorization. The key idea is
relatively small number of examples for training, we are to quantize each extracted key point into one of visual words,
able to examine the sensitivity of a vector quantization and then represent each image by a histogram of the visual
algorithm to the number of visual words. On average 1,000 words. For this purpose, a clustering algorithm (e.g.,
key points are extracted from each image, and each key K-means), is generally used for generating the visual words.
point is represented by the SIFT local descriptor [23]. For Although a number of studies have shown encouraging
PASCAL06 data set, the binary classification performance results of the bag-of-words representation for object
for each object class is measured by the area under the categorization, theoretical studies on properties of the bag-
ROC curve (AUC). For Graz02 data set, the binary clas- of-words model is almost untouched, possibly due to the
sification performance for each object class is measured by difficulty introduced by using a heuristic clustering process.
the accuracy. Results averaged over ten random trials are In this paper, we present a statistical framework which
reported. generalizes the bag-of-words representation. In this frame-
We compare three vector quantization methods: work, the visual words are generated by a statistical process
K-means, QEE and QKE. Note that we do not include more rather than using a clustering algorithm, while the empirical
advanced algorithms for vector quantization in our study performance is competitive to clustering-based method.
because the objective of this study is to validate the pro- A theoretical analysis based on statistical consistency is
posed statistical framework for bag-of-words representa- presented for the proposed framework. Moreover, based on
tion and the analysis on statistical consistency. Threshold q the framework we developed two algorithms which do not
used by quantization functions f ðxÞ is set as q ¼ 0:5  d;  rely on clustering, while achieving competitive performance

123
Int. J. Mach. Learn. & Cyber. (2010) 1:43–52 51

0.8 0.76
0.85
0.8
0.74

0.75 0.72
0.8
0.75
0.7
AUC

AUC

AUC
AUC
0.7 0.75 0.68
0.7
0.66

0.65 K−means 0.65 K−means 0.7 K−means 0.64 K−means


QEE QEE QEE 0.62
QEE
QKE QKE QKE QKE
0.6 0.6 0.65 0.6
10 20 50 100 200 500 1,000 2,000 5,000 10,000 10 20 50 100 200 500 1,000 2,000 5,000 10,000 10 20 50 100 200 500 1,000 2,000 5,000 10,000 10 20 50 100 200 500 1,000 2,000 5,000 10,000
No. of visual words No. of visual words No. of visual words No. of visual words
bicycle bus car cat

0.66
0.75
0.8
0.7 0.64
0.68 0.62
0.7
0.75
0.66 0.6
AUC

AUC
AUC

AUC
0.64
0.58 0.65
0.7
0.62
0.56
0.6
0.65 K−means K−means 0.54 K−means 0.6 K−means
QEE 0.58 QEE 0.52
QEE QEE
QKE 0.56 QKE QKE QKE
0.6 0.5 0.55
10 20 50 100 200 500 1,000 2,000 5,000 10,000 10 20 50 100 200 500 1,0002,000 5,00010,000 10 20 50 100 200 500 1,000 2,000 5,000 10,000 10 20 50 100 200 500 1,0002,000 5,000 10,000
No. of visual words No. of visual words No. of visual words No. of visual words
cow dog horse motor

0.64 0.8

0.62

0.6
0.75
0.58
AUC

AUC

0.56
0.7
0.54
K−means K−means
0.52 QEE QEE
QKE QKE
0.5 0.65
10 20 50 100 200 500 1,000 2,000 5,000 10,000 10 20 50 100 200 500 1,000 2,000 5,000 10,000
No. of visual words No. of visual words
person sheep

Fig. 1 Comparison of different quantization methods with varied number of visual words on PASCAL06

0.6
for key point quantization that generalizes the bag-of-
words model by statistical expectation. We present two
0.58
random algorithms for vector quantization where the visual
0.56 words are generated by a statistical process rather than
0.54 using a clustering algorithm. A theoretical analysis of their
0.52 statistical consistency is presented. We also verify the
Accuracy

0.5
efficacy and the robustness of the proposed framework by
applying it to object recognition. In the future, we plan to
0.48
examine the dependence of the proposed algorithms on the
0.46 threshold q, and extend QKE to weighted kernel density
0.44 K−means estimation.
QEE
0.42
QKE Acknowledgments We want to thank the reviewers for helpful
0.4 comments and suggestions. This research is partially supported by the
10 20 50 100 200 500 1,000 2,000 5,000 10,000
National Fundamental Research Program of China (2010CB327903),
No. of visual words
the Jiangsu 333 High-Level Talent Cultivation Program and the
National Science Foundation (IIS-0643494). Any opinions, findings
Fig. 2 Comparison of different quantization methods with varying
and conclusions or recommendations expressed in this material are
number of visual words on Graz02
those of the authors and do not necessarily reflect the views of the
funding agencies.
in object categorization when compared to clustering-based
bag-of-words representations.
Bag-of-words representation is a popular approach to References
object categorization. Despite its success, few studies are
1. Abramowitz M, Stegun IA (eds) (1972) Handbook of mathe-
devoted to the theoretic analysis of the bag-of-words rep- matical functions with formulas, graphs, and mathematical tables.
resentation. In this work, we present a statistical framework Dover, New York

123
52 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52

2. Bartlett PL, Wang M (2002) Rademacher and Gaussian com- 15. Opelt A, Pinz A, Fussenegger M, Auer P (2006) Generic object
plexities: risk bounds and structural results. J Mach Learn Res recognition with boosting. IEEE Trans Pattern Anal Mach Intell
3:463–482 28(3):416–431
3. Csurka G, Dance C, Fan L, Williamowski J, Bray C (2004) 16. Perronnin F, Dance C, Csurka G, Bressian M (2006) Adapted
Visual categorization with bags of keypoints. In: ECCV work- vocabularies for generic visual categorization. In: Proceedings of
shop on statistical learning in computer vision, Prague, Czech the 9th European conference on computer vision, Graz, Austria,
Republic, 2004 pp 464–475
4. Everingham M, Zisserman A, Williams CKI, Van Gool L (2006) 17. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in
The PASCAL visual object classes challenge 2006 (VOC2006) quantization: improving particular object retrieval in large scale
results. http://www.pascal-network.org/challenges/VOC/voc2006/ image databases. In: Proceedings of the IEEE computer society
results.pdf conference on computer vision and pattern recognition, Anchor-
5. Farquhar J, Szedmak S, Meng H, Shawe-Taylor J (2005) age, AK
Improving ‘‘bag-of-keypoints’’ image categorisation. Technical 18. Schölkopf B, Smola AJ (2002) Learning with kernels: support
report, University of Southampton vector machines, regularization, optimization, and beyond. MIT
6. Joachims T (1998) Text categorization with suport vector Press, Cambridge
machines: learning with many relevant features. In: Proceedings 19. Shawe-Taylor J, Dolia A (2007) A framework for probability
of the 10th European conference on machine learning. Chemnitz, density estimation. In: Proceedings of the 11th international
Germany, pp 137–142 conference on artificial intelligence and statistics, San Juan,
7. Jurie F, Triggs B (2005) Creating efficient codebooks for visual Puerto Rico, pp 468–475
recognition. In: Proceedings of the 10th IEEE international con- 20. Sivic J, Zisserman A (2003) Video Google: A text retrieval
ference on computer vision, Beijing, China, 2005, pp 604–610 approach to object matching in videos. In: Proceedings of the 9th
8. Lazebnik S, Raginsky M (2009) Supervised learning of quantizer IEEE international conference on computer vision, Nice, France,
codebooks by information loss minimization. IEEE Trans Pattern pp 1470–1477
Anal Mach Intell 31(7):1294–1309 21. Tuytelaars T, Schmid C (2007) Vector quantizing feature space
9. Lowe D (2004) Distinctive image features from scale-invariant with a regular lattice. In: Proceedings of the 11th IEEE interna-
keypoints. Int J Comput Vis 60(2):91–110 tional conference on computer vision, Rio de Janeiro, Brazil,
10. McCallum A, Nigam K (1998) A comparison of event models for pp 1–8
naive bayes text classification. In: AAAI workshop on learning 22. van Gemert JC, Geusebroek J-M, Veenman CJ, Smeulders AWM
for text categorization, Madison, WI (2008) Kernel codebooks for scene categorization. In: Proceed-
11. McDiarmid C (1989) On the method of bounded differences. In: ings of the 10th European conference on computer vision,
Surveys in combinatorics 1989, pp 148–188 Marseille, France, pp 696–709
12. Moosmann F, Triggs B, Jurie F (2007) Fast discriminative visual 23. Vedaldi A, Fulkerson B (2008) VLFeat: An open and portable
codebooks using randomized clustering forests. In: Schölkopf B, library of computer vision algorithms. http://www.vlfeat.org/
Platt J, Hoffman T (eds) Advances in neural information pro- 24. Viitaniemi V, Laaksonen J (2008) Experiments on selection of
cessing systems, vol 19. MIT Press, Cambridge, pp 985–992 codebooks for local image feature histograms. In: Proceedings of
13. Nister D, Stewenius H (2006) Scalable recognition with a the 10th international conference series on visual information
vocabulary tree. In: Proceedings of the IEEE computer society systems, Salerno, Italy, pp 126–137
conference on computer vision and pattern recognition, New 25. Winn J, Criminisi A, Minka T (2005) Object categorization by
York, NY, pp 2161–2168 learned universal visual dictionary. In: Proceedings of the 10th
14. Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of- IEEE international conference on computer vision, Beijing,
features image classification. In: Proceedings of the 9th European China, pp 1800–1807
conference on computer vision, Graz, Austria, pp 490–503

123

You might also like