Understanding Bag-Of-Words Model: A Statistical Framework
Understanding Bag-Of-Words Model: A Statistical Framework
Understanding Bag-Of-Words Model: A Statistical Framework
(2010) 1:43–52
DOI 10.1007/s13042-010-0001-0
ORIGINAL ARTICLE
Received: 27 February 2010 / Accepted: 2 July 2010 / Published online: 28 August 2010
Springer-Verlag 2010
Abstract The bag-of-words model is one of the most Keywords Object recognition Bag of words model
popular representation methods for object categorization. Rademacher complexity
The key idea is to quantize each extracted key point into
one of visual words, and then represent each image by a
histogram of the visual words. For this purpose, a clus- 1 Introduction
tering algorithm (e.g., K-means), is generally used for
generating the visual words. Although a number of studies Inspired by the success of text categorization [6,10], a
have shown encouraging results of the bag-of-words rep- bag-of-words representation becomes one of the most
resentation for object categorization, theoretical studies on popular methods for representing image content and has
properties of the bag-of-words model is almost untouched, been successfully applied to object categorization. In a
possibly due to the difficulty introduced by using a heu- typical bag-of-words representation, ‘‘interesting’’ local
ristic clustering process. In this paper, we present a sta- patches are first identified from an image, either by densely
tistical framework which generalizes the bag-of-words sampling [14,25] or by a interest point detector [9]. These
representation. In this framework, the visual words are local patches, represented by vectors in a high dimensional
generated by a statistical process rather than using a clus- space [9], are often referred to as the key points.
tering algorithm, while the empirical performance is To efficiently handle these key points, the key idea is to
competitive to clustering-based method. A theoretical quantize each extracted key point into one of visual words,
analysis based on statistical consistency is presented for the and then represent each image by a histogram of the visual
proposed framework. Moreover, based on the framework words. This vector quantization procedure allows us to
we developed two algorithms which do not rely on clus- represent each image by a histogram of the visual words,
tering, while achieving competitive performance in object which is often referred to as the bag-of-words representa-
categorization when compared to clustering-based bag-of- tion, and consequently converts the object categorization
words representations. problem into a text categorization problem. A clustering
procedure (e.g., K-means) is often applied to group key
points from all the training images into a large number of
clusters, with the center of each cluster corresponding to a
Y. Zhang Z.-H. Zhou (&)
different visual word. Studies [3,20] have shown promising
National Key Laboratory for Novel Software Technology,
Nanjing University, Nanjing 210093, China performance of bag-of-words representation in object cat-
e-mail: [email protected] egorization. Various methods [7,8,12,13,17,21,25] have
Y. Zhang been proposed for the visual vocabulary construction to
e-mail: [email protected] improve both the computational efficiency and the classi-
fication accuracy of object categorization. However, to the
R. Jin
best of our knowledge, there is no theoretical analysis on
Department of Computer Science & Engineering,
Michigan State University, East Lansing, MI 48824, USA the statistical properties of vector quantization for object
e-mail: [email protected] categorization.
123
44 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52
In this paper, we present a statistical framework which algorithm is proposed to reduce the visual vocabulary that
generalizes the bag-of-words representation and aim to is initially obtained by K-means, into a more descriptive
provide a theoretical understanding for vector quantization and compact one. Farquhar et al. [5] model the problem as
and its effect on object categorization from the viewpoint Gaussian mixture model where each visual words corre-
of statistical consistency. In particular, we view sponds to a Gaussian component and use the maximum a
posterior (MAP) approach to learn the parameter. A
1. each visual word as a quantization function fk ðxÞ that is
method based on mean-shift is proposed in [7] for vector
randomly sampled from a class of functions F by an
quantization to resolve the problem that K-means tends to
unknown distribution P F ; and
‘starve’ medium density regions in feature space and each
2. each key point of an image as a random sample from
key point is allocated to the first visual word similar to it.
an unknown distribution qi ðxÞ:
Moosmann et al. [12] use extremely randomized clustering
The above statistical description of key points and visual forests to efficiently generate a highly discriminative cod-
words allows us to interpret the similarity between two ing of visual words. To minimize the loss of information in
images in bag-of-words representation, the key quantity in vector quantization, Lazebnik and Raginsky [8] try to seek
object categorization, as an empirical expectation over a compressed representation of vectors that preserve the
the distributions qi ðxÞ and P F : Based on the proposed sufficient statistics of features. In [16], images are char-
statistical framework, we present two random algorithms acterized using a set of category-specific histograms
for vector quantization, one based on the empirical describing whether the content can best be modeled by the
distribution and the other based on kernel density estima- universal vocabulary or the specific vocabulary. Tuytelaars
tion. We show that both random algorithms for vector and Schmid [21] propose a quantization method that
quantization are statistically consistent in estimating the discretizes a feature space by a regular lattice. van
similarity between two images. Our empirical study with Gemert et al. [22] use kernel density estimation to
object recognition also verifies that the two proposed avoid the problem of ‘codeword uncertainty’ and ‘code-
algorithms (I) yield recognition accuracy that is compara- word plausibility’.
ble to the clustering based bag-of-words representation, Although many studies have shown encouraging results
and (II) are resilient to the number of visual words when of the bag-of-words representation for object categoriza-
the number of training examples is limited. The success of tion, none of them provide statistical consistency analysis,
the two simple algorithms validates the proposed statistical which reveals the asymptotic behavior of the bag-of-words
framework for vector quantization. model for object recognition. Unlike the existing statistical
The rest of this paper is organized as follows. Section 2 approaches for key point quantization that are designed to
presents the overview of existing approaches for key point reduce the training error, the proposed framework gener-
quantization that were used by object recognition. Sec- alizes the bag-of-words model by the statistical expecta-
tion 3 presents a statistical framework that generalizes the tion, making it possible to analyze the statistical
classical bag-of-words representation, and two random consistency of the bag-of-words model. Finally, we would
algorithms for vector quantization based on the proposed like to point out that although several randomized
framework. We show that both algorithms are statistically approaches [12,14,24] have been proposed for key point
consistent in estimating the similarity between two images. quantization, none of them provides theoretical analysis on
Empirical study with object recognition reported in Sect. 4 statistical consistency. In contrast, we present not only the
shows encouraging results of the proposed algorithms for theoretic results for the two proposed random algorithms
vector quantization, which in return validates the proposed for vector quantization, but also the results of the empirical
statistical framework for the bag-of-words representation. study with object recognition that support the theoretic
Section 5 concludes this work. claim.
123
Int. J. Mach. Learn. & Cyber. (2010) 1:43–52 45
3.1 A statistical framework pairwise similarity plays a critical role in any pattern
classification problems including object categorization.
We consider the bag-of-words representation for images, According to the learning theory [18], it is the pairwise
with each image being represented by a collection of local similarity, not the vector representation of images, that
descriptors. We denote by N the number of training images, decides the classification performance. Using the vector
and by Xi ¼ ðx1i ; . . .; xni i Þ the collection of key points used representation hi and hj ; the similarity between two images
to represent image I i where xli 2 X ; l ¼ 1; . . .; ni is a key I i and I j ; denoted by sij ; is computed as
point in feature space X : To facilitate statistical analysis, 1 1X m
we assume that each key point xli in Xi is randomly sij ¼ hTi hj ¼ Ei ½fk ðxÞEj ½fk ðxÞ ð4Þ
m m k¼1
drawn from an unknown distribution qi ðxÞ associated with
image I i : Similar to the previous analysis, the summation in the
The key idea of the bag-of-words representation is to above expression can be viewed as an empirical
quantize each key point into one of the visual words that expectation over the sampled quantization functions
are often derived by clustering. We generalize this idea of fk ðxÞ; k ¼ 1; . . .; m: We thus generalize the definition of
quantization by viewing the mapping to a visual word vk 2 pairwise similarity in Eq. 4 by replacing the empirical
X as a quantization function fk ðxÞ : X 7! ½0; 1: Due to the expectation with the true expectation, and obtain the true
uncertainty in constructing the vocabulary, we assume that similarity between two images I i and I j as
the quantization function fk ðxÞ is randomly drawn from a
sij ¼ Ef P F Ei ½f ðxÞEj ½f ðxÞ ð5Þ
class of functions, denoted by F ; via a unknown distribu-
tion P F : To capture the behavior of quantization, we According to the definition in Eq. 1, each quantization
design the function class F as follows function is parameterized by a center v: Thus, to define P F ;
F ¼ ff ðx; vÞjf ðx; vÞ ¼ Iðkx vk qÞ; v 2 X g ð1Þ it suffices to define a distribution for the center v; denoted
by qðvÞ: Thus, Eq. 5 can be expressed as
where indicator function I(z) outputs 1 when z is true, or 0
otherwise. In the above definition, each quantization sij ¼ Ev Ei ½f ðxÞEj ½f ðxÞ : ð6Þ
function f ðx; vÞ is essentially a ball of radius q centered at
v: It outputs 1 when a point x is within the ball, and 0 if x is
3.2 Random algorithms for key point quantization
outside the ball. This definition of quantization function is
and their statistical consistency
clearly related to the vector quantization by data clustering.
Based on the above statistical interpretation of key
We emphasize that the pairwise similarity in Eq. 6 can not
points and quantization functions, we can now provide a
be computed directly. This is because both distributions
statistical description for the histogram of visual words,
k
qi ðxÞ and qðvÞ are unknown, which makes it intractable to
which is the key of bag-of-words representation. Let h^ i compute Ei ½ and Ev ½: In real applications, approxima-
denotes the normalized number of key points in image I i tions are needed. In this section, we study how approxi-
that are mapped to visual word vk : Given m visual words, mations will affect the estimation of pairwise similarity. In
or m quantization functions ffk ðxÞgmk¼1 that are sampled particular, given the pairwise similarity estimated by dif-
k ferent kinds of approximated distributions, we aim to
^
from F ; h is computed as
i
bound its difference to the underlying true similarity. To
k 1X ni
h^i ¼ fk xli ¼ E
^ i ½fk ðxÞ ð2Þ simplify our analysis, we assume that each image has at
ni j¼1 least n key points.
By assuming that the key points in all the images are
where E ^ i ½fk ðxÞ stands for the empirical expectation of
sampled from qðvÞ; we have an empirical distribution for
function fk ðxÞ based on the samples x1i ; . . .; xni i : We can qðvÞ; i.e.,
generalize the above computation by replacing the
empirical expectation E ^ i ½fk ðxÞ with an expectation over 1 X
N X
ni
q^ðvÞ ¼ PN d v xli ð7Þ
the true distribution qi ðxÞ; i.e., i¼1 ni i¼1 l¼1
Z
R
hki ¼ Ei ½fk ðxÞ ¼ d xqi ðxÞfk ðxÞ: ð3Þ where dðxÞ is a Dirac delta function that dðxÞ dx ¼ 1
and dðxÞ ¼ 0 for x 6¼ 0: Direct estimation of pairwise
The bag-of-words representation for image I i is expressed similarities using the above empirical distribution is
by vector hi ¼ ðh1i ; . . .; hm
i Þ: computationally expensive, because the number of key
In the next step, we analyze the pairwise similarity points in all images can be very large. In the bag-of-words
between two images. It is important to note that the model, m visual words are used as prototypes for the key
123
46 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52
sup jf ðvÞ f ðv0 Þj ci ð11Þ with probability 1 - d/2, we have jE ^ i ½fk ðxÞ Ei ½fk ðxÞj
v1 ;v2 ;...;vn ;v0i 2V
and jE^ j ½fk ðxÞ Ej ½fk ðxÞj for all fk ðxÞm
k¼1 simulta-
where v ¼ ðv1 ; v2 ; . . .; vn Þ and v0 ¼ ðv1 ; v2 ; . . .; vi1 ; v0i ; neously. As a result, with probability 1 - d/2, for any two
viþ1 ; . . .; vn Þ; then the following statement holds image I i and I j ; we have
123
Int. J. Mach. Learn. & Cyber. (2010) 1:43–52 47
^ k stands for E where ali(1 B l B ni) are the combination weight that sat-
where E i
^ i ½fk ðxÞ for simplicity. According to Pni l
Theorem 2, with probability 1 - d/2, we have isfy (i) ali C 0, (ii) 2
l¼1 ai ¼ 1; and (iii) aiKiai B B ,
l l0
rffiffiffiffiffiffiffiffiffiffiffiffiffiffi where Ki ¼ ½jðxi ; xi Þni ni :
1 2 Using the kernel density function, we approximate the
sij sij j
j ln ð18Þ
2m d pairwise similarity for Eq. 21 as follows
Combining Eqs. 17 and 18, we have the result in the
~sij ¼ E
^v E~ i ½f ðxÞE
~ j ½f ðxÞ
! n !
theorem. With probability 1 - d, the following inequality
1X m Xni
X j
is satisfied ¼ al h xl ; v k alj h xlj ; vk ð22Þ
rffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi m k¼1 l¼1 i i l¼1
1 2 1 4m2
j^
sij sij j ln þ 2 ln ð19Þ
2m d 2n d where function hðx; vÞ is defined as
Z
Remark Theorem 3 reveals an interesting relationship hðx; vÞ ¼ dz I ðdðz; vÞ qÞjðx; zÞ ð23Þ
between the estimation error jsij ^ sij j and the number of
quantization functions (or the number of visual words). The
To bound the difference between s~ij and sij, we follow the
upper bound in Theorem 3 consists of two terms: the first
pffiffiffiffi analysis [19] by viewing Ei ½f ðxÞEj ½f ðxÞ as a mapping,
term decreases at a rate of Oð1= mÞ while the second term
denoted by g : F 7! Rþ ; i.e.,
increases at a rate of Oðln mÞ: When the number of visual
words m is small, the first term dominates the upper gðf ; qi ; qj Þ ¼ Ei ½f ðxÞEj ½f ðxÞ ð24Þ
bound, and therefore increasing m will reduce the difference
The domain for function g, denoted by G; is defined as
j^
sij sij j: As m becomes significantly larger than n, the
second term will dominate the upper bound, and therefore G ¼ g : F 7! Rþ
9qi ; qj 2 F D s.t. gðf Þ ¼ Ei ½f ðxÞEj ½f ðxÞ
increasing m will lead to a larger j^ sij sij j: This result ð25Þ
appears to be consistent with the observations on the size
of the visual vocabulary: a large vocabulary tends to To bound the complexity of a class of functions, we
performance well in object categorization; but, too many introduce the concept of Randemacher complexity [2]:
visual words could deteriorate the classification accuracy. Definition 1 (Randemacher complexity) Suppose x1,..., xn
Finally, we emphasize that although the idea of vector are sampled from a set X with i.i.d. Let F be a class of
quantization by randomly sampled centers was already functions mapping from X to R: The Randemacher com-
discussed in [7,24], to the best of our knowledge, this is the plexity of F is defined as
first work that presents its statistical consistency analysis. !
2X n
Rn ðF Þ ¼ Ex1 ;...;xn ;r sup ri f ðxi Þ ð26Þ
f 2F n i¼1
3.2.2 Kernel density function estimation for qi ðxÞ
where ri is independent uniform ± 1-valued random
In this section, we approximate qi ðxÞ by a kernel density variables.
estimation. To this end, we assume that the density func-
tion qi ðxÞ belongs to a family of smooth functions F D that Assuming at least n key points are randomly sampled
is defined as follows from each image, we have the following lemmas that
Z bounds the complexity of domain G :
F D ¼ qðxÞ : X 7!Rþ
hqðxÞ; qðxÞiHj B ; qðxÞ dx ¼ 1
2
Lemma 1 The Rademacher complexity of function class
ð20Þ G; denoted by Rm ðGÞ; is bounded as
123
48 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52
R
Ef
dxjf ðxÞj From [2], we have the following lemmas:
Rm ðGÞ 2BCj pffiffiffiffi ð27Þ
m
pffiffiffiffiffiffiffiffiffiffiffiffiffi Lemma 2 (Theorem 12 in [2]) For 1 B q \ ?, let L ¼
where Cj ¼ maxx;z jðx; zÞ fjf hjq : f 2 F g; wherehand kf hk1 is uniformly
Proof Denote F = {f1,..., fm}, according to the definition, bounded. We have
we have khk1
" # Rn ðLÞ 2qkf hk1 Rn ðF Þ þ pffiffiffi ð30Þ
n
2X m
Rm ðGÞ ¼ Er;F sup rk gðfk Þ Lemma 3 (Theorem 8 in [2]) With probability 1 - d the
g2G m k¼1
" "
## following inequality holds
2X m
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
¼ EF Er sup rk gðfk Þ
F 8 lnð2=dÞ
g2G m k¼1
E/ðY; f ðXÞÞ E
^ n /ðY; f ðXÞÞ þ Rn ð/ F Þ þ
" "
## n
2X m
ð31Þ
¼ EF Er sup rk Ei ½fk Ej ½fk
F
qi ;qj 2F D m k¼1
" "
## where /(x, y) is the loss function, n is the number
2X m
of samples and / F ¼ fðx; yÞ 7! /ðy; f ðxÞÞ /ðy; 0Þ :
EF Er sup rk Ei ½fk
F
kxi k B m
f 2 F g:
k¼1
" " * +
##
2 X m
Based on the above lemmas, we have the following
¼ EF Er sup xi ; rk Uk
F theorem
m kxi k B k¼1
where Uk ¼ ðh/1 ðÞ; fk ðÞi; h/2 ðÞ; fk ðÞi; . . .Þ Theorem 4 Assume that the density function
qi ðxÞ; qj ðxÞ 2 F D : Let q~i ðxÞ; q~j ðxÞ 2 F D be an estimated
and /k ðxÞ is an eigen function
density function from n sampled key points. We have, with
of jðx; x0 Þ probability 1 - d, the following inequality holds
" "
##
2B X m
Ef ½jgðf ; q~i ; q~j Þ gðf ; qi ; qj Þj E ^ f ½jgðf ; q~i ; q~j Þ gðf ; q^i ; q^j Þj
EF Er rk Uk
F
m
ð28Þ R
k¼1 Ef dxjf ðxÞj 1
2 2 !12
F 55 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
m k;t
lnð8=dÞ lnð8m =dÞ
þ þ2 ð32Þ
2 !12 3 2m 2n
2B 4 X
EF Er ½rk rt hUk ; Ut ðxÞijF 5 Proof From Lemma 3, with probability 1 - d/2, we have
m k;t
2 Ef ½jgðf ; q~i ; q~j Þ gðf ; qi ; qj Þj
!1 3
2B 4 X 2
2 E^ f ½jgðf ; q~i ; q~j Þ gðf ; qi ; qj Þj
¼ EF Er rk hUk ; Uk i
F 5 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
m k 8 lnð4=dÞ
2 þ Rm ðjG gðf ; qi ; qj ÞjÞ þ ð33Þ
!12 3 m
2B 4 X
¼ EF hU k ; U k i 5 Since 0 B g(f; qi, qj) B 1, using the results in Lemma 1
m k and 2, we have
2 !12 3
Z
2B 4 X 1
Rm jG gðf ; qi ; qj Þj 2 Rm ðGÞ þ pffiffiffiffi
¼ EF dz dxfk ðxÞfk ðzÞjðx; zÞ 5 m
m R
k
R Ef dxjf ðxÞj 1
Ef dxjf ðxÞj 2 2BCj pffiffiffiffi þ pffiffiffiffi
2BCj pffiffiffiffi m m
m
ð34Þ
where the first inequality is because Ej ½fk 1; the second
Hence, we have, with probability 1 - d/2 the following
inequality is from Cauchy’s inequality, the third and fourth
inequality holds
inequalities are from Jensen’s inequality. The last equality
follows Ef ½jgðf ; q~i ; q~j Þ gðf ; qi ; qj Þj E
^ f ½jgðf ; q~i ; q~j Þ gðf ; qi ; qj Þj
X Z R rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Ef dxjf ðxÞj 1 8 lnð4=dÞ
hUk ; Uk i ¼ h/i ðÞ; fk ðÞi2 ¼ dz dxfk ðxÞfk ðzÞjðx; zÞ þ 2 2BCj pffiffiffiffi þ pffiffiffiffi þ
i m m m
ð29Þ ð35Þ
123
Int. J. Mach. Learn. & Cyber. (2010) 1:43–52 49
123
50 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52
Finally in the implementation of QKE, to efficiently where d is the average distance between all the key points
calculate h-function, we approximate it as [1] and the randomly selected centers. A RBF kernel is used in
2ðd~ q~ Þ2 1 QKE with the kernel width r is set as 0:75d according to
h
pffiffiffi our experience. Binary linear SVM is used for each clas-
4 pðd~ q ~Þ3 expðd~ q~ Þ2
2 sification problem. To examine the sensitivity to the
2ðd~ þ q~Þ 1
pffiffiffi ð41Þ number of visual words, for both data sets, we varied the
4 pðd~ þ q ~Þ3 expðd~ þ q~ Þ2 number of visual words from 10 to 10,000, as shown in
where d~ ¼ d=r; q
~ ¼ q=r and r is the width of the Gaussian Figs. 1 and 2.
kernel. First, we observe that the proposed algorithms for vector
quantization yield comparable if not better performance
than the K-means clustering algorithm. This confirms the
proposed statistical framework for key point quantization is
effective. Second, we observe that the clustering based
approach for vector quantization tends to perform worse,
sometimes very significantly, when the number of visual
words is large. We attribute this instability to the fact that
K-means requires each interest point belongs to exactly one
visual word. If the number of clusters is not appropriate, for
example, too large compared to the number of instances,
two relevant key points may be separated into different
clusters although they are both very near to the boundary. It
will lead to a poor estimation of pairwise similarity. The
problem of ‘‘hard assignment’’ was also observed in
[17,22]. In contrast, for the proposed algorithms, we
observe a rather stable improvement as the number of
visual words increases, consistent with our analysis in
Two data sets are used in our study: PASCAL VOC statistical consistency.
Challenge 2006 data set [4] and Graz02 data set [15].
PASCAL06 contains 5,304 images from 10 classes. We
randomly select 100 images for training and 500 for test-
ing. The Graz02 data set contains 365 bike images, 420 car 5 Conclusion
images, 311 people images and 380 background images.
We randomly select 100 images from each class for The bag-of-words model is one of the most popular repre-
training, and use the remaining for testing. By using a sentation methods for object categorization. The key idea is
relatively small number of examples for training, we are to quantize each extracted key point into one of visual words,
able to examine the sensitivity of a vector quantization and then represent each image by a histogram of the visual
algorithm to the number of visual words. On average 1,000 words. For this purpose, a clustering algorithm (e.g.,
key points are extracted from each image, and each key K-means), is generally used for generating the visual words.
point is represented by the SIFT local descriptor [23]. For Although a number of studies have shown encouraging
PASCAL06 data set, the binary classification performance results of the bag-of-words representation for object
for each object class is measured by the area under the categorization, theoretical studies on properties of the bag-
ROC curve (AUC). For Graz02 data set, the binary clas- of-words model is almost untouched, possibly due to the
sification performance for each object class is measured by difficulty introduced by using a heuristic clustering process.
the accuracy. Results averaged over ten random trials are In this paper, we present a statistical framework which
reported. generalizes the bag-of-words representation. In this frame-
We compare three vector quantization methods: work, the visual words are generated by a statistical process
K-means, QEE and QKE. Note that we do not include more rather than using a clustering algorithm, while the empirical
advanced algorithms for vector quantization in our study performance is competitive to clustering-based method.
because the objective of this study is to validate the pro- A theoretical analysis based on statistical consistency is
posed statistical framework for bag-of-words representa- presented for the proposed framework. Moreover, based on
tion and the analysis on statistical consistency. Threshold q the framework we developed two algorithms which do not
used by quantization functions f ðxÞ is set as q ¼ 0:5 d; rely on clustering, while achieving competitive performance
123
Int. J. Mach. Learn. & Cyber. (2010) 1:43–52 51
0.8 0.76
0.85
0.8
0.74
0.75 0.72
0.8
0.75
0.7
AUC
AUC
AUC
AUC
0.7 0.75 0.68
0.7
0.66
0.66
0.75
0.8
0.7 0.64
0.68 0.62
0.7
0.75
0.66 0.6
AUC
AUC
AUC
AUC
0.64
0.58 0.65
0.7
0.62
0.56
0.6
0.65 K−means K−means 0.54 K−means 0.6 K−means
QEE 0.58 QEE 0.52
QEE QEE
QKE 0.56 QKE QKE QKE
0.6 0.5 0.55
10 20 50 100 200 500 1,000 2,000 5,000 10,000 10 20 50 100 200 500 1,0002,000 5,00010,000 10 20 50 100 200 500 1,000 2,000 5,000 10,000 10 20 50 100 200 500 1,0002,000 5,000 10,000
No. of visual words No. of visual words No. of visual words No. of visual words
cow dog horse motor
0.64 0.8
0.62
0.6
0.75
0.58
AUC
AUC
0.56
0.7
0.54
K−means K−means
0.52 QEE QEE
QKE QKE
0.5 0.65
10 20 50 100 200 500 1,000 2,000 5,000 10,000 10 20 50 100 200 500 1,000 2,000 5,000 10,000
No. of visual words No. of visual words
person sheep
Fig. 1 Comparison of different quantization methods with varied number of visual words on PASCAL06
0.6
for key point quantization that generalizes the bag-of-
words model by statistical expectation. We present two
0.58
random algorithms for vector quantization where the visual
0.56 words are generated by a statistical process rather than
0.54 using a clustering algorithm. A theoretical analysis of their
0.52 statistical consistency is presented. We also verify the
Accuracy
0.5
efficacy and the robustness of the proposed framework by
applying it to object recognition. In the future, we plan to
0.48
examine the dependence of the proposed algorithms on the
0.46 threshold q, and extend QKE to weighted kernel density
0.44 K−means estimation.
QEE
0.42
QKE Acknowledgments We want to thank the reviewers for helpful
0.4 comments and suggestions. This research is partially supported by the
10 20 50 100 200 500 1,000 2,000 5,000 10,000
National Fundamental Research Program of China (2010CB327903),
No. of visual words
the Jiangsu 333 High-Level Talent Cultivation Program and the
National Science Foundation (IIS-0643494). Any opinions, findings
Fig. 2 Comparison of different quantization methods with varying
and conclusions or recommendations expressed in this material are
number of visual words on Graz02
those of the authors and do not necessarily reflect the views of the
funding agencies.
in object categorization when compared to clustering-based
bag-of-words representations.
Bag-of-words representation is a popular approach to References
object categorization. Despite its success, few studies are
1. Abramowitz M, Stegun IA (eds) (1972) Handbook of mathe-
devoted to the theoretic analysis of the bag-of-words rep- matical functions with formulas, graphs, and mathematical tables.
resentation. In this work, we present a statistical framework Dover, New York
123
52 Int. J. Mach. Learn. & Cyber. (2010) 1:43–52
2. Bartlett PL, Wang M (2002) Rademacher and Gaussian com- 15. Opelt A, Pinz A, Fussenegger M, Auer P (2006) Generic object
plexities: risk bounds and structural results. J Mach Learn Res recognition with boosting. IEEE Trans Pattern Anal Mach Intell
3:463–482 28(3):416–431
3. Csurka G, Dance C, Fan L, Williamowski J, Bray C (2004) 16. Perronnin F, Dance C, Csurka G, Bressian M (2006) Adapted
Visual categorization with bags of keypoints. In: ECCV work- vocabularies for generic visual categorization. In: Proceedings of
shop on statistical learning in computer vision, Prague, Czech the 9th European conference on computer vision, Graz, Austria,
Republic, 2004 pp 464–475
4. Everingham M, Zisserman A, Williams CKI, Van Gool L (2006) 17. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in
The PASCAL visual object classes challenge 2006 (VOC2006) quantization: improving particular object retrieval in large scale
results. http://www.pascal-network.org/challenges/VOC/voc2006/ image databases. In: Proceedings of the IEEE computer society
results.pdf conference on computer vision and pattern recognition, Anchor-
5. Farquhar J, Szedmak S, Meng H, Shawe-Taylor J (2005) age, AK
Improving ‘‘bag-of-keypoints’’ image categorisation. Technical 18. Schölkopf B, Smola AJ (2002) Learning with kernels: support
report, University of Southampton vector machines, regularization, optimization, and beyond. MIT
6. Joachims T (1998) Text categorization with suport vector Press, Cambridge
machines: learning with many relevant features. In: Proceedings 19. Shawe-Taylor J, Dolia A (2007) A framework for probability
of the 10th European conference on machine learning. Chemnitz, density estimation. In: Proceedings of the 11th international
Germany, pp 137–142 conference on artificial intelligence and statistics, San Juan,
7. Jurie F, Triggs B (2005) Creating efficient codebooks for visual Puerto Rico, pp 468–475
recognition. In: Proceedings of the 10th IEEE international con- 20. Sivic J, Zisserman A (2003) Video Google: A text retrieval
ference on computer vision, Beijing, China, 2005, pp 604–610 approach to object matching in videos. In: Proceedings of the 9th
8. Lazebnik S, Raginsky M (2009) Supervised learning of quantizer IEEE international conference on computer vision, Nice, France,
codebooks by information loss minimization. IEEE Trans Pattern pp 1470–1477
Anal Mach Intell 31(7):1294–1309 21. Tuytelaars T, Schmid C (2007) Vector quantizing feature space
9. Lowe D (2004) Distinctive image features from scale-invariant with a regular lattice. In: Proceedings of the 11th IEEE interna-
keypoints. Int J Comput Vis 60(2):91–110 tional conference on computer vision, Rio de Janeiro, Brazil,
10. McCallum A, Nigam K (1998) A comparison of event models for pp 1–8
naive bayes text classification. In: AAAI workshop on learning 22. van Gemert JC, Geusebroek J-M, Veenman CJ, Smeulders AWM
for text categorization, Madison, WI (2008) Kernel codebooks for scene categorization. In: Proceed-
11. McDiarmid C (1989) On the method of bounded differences. In: ings of the 10th European conference on computer vision,
Surveys in combinatorics 1989, pp 148–188 Marseille, France, pp 696–709
12. Moosmann F, Triggs B, Jurie F (2007) Fast discriminative visual 23. Vedaldi A, Fulkerson B (2008) VLFeat: An open and portable
codebooks using randomized clustering forests. In: Schölkopf B, library of computer vision algorithms. http://www.vlfeat.org/
Platt J, Hoffman T (eds) Advances in neural information pro- 24. Viitaniemi V, Laaksonen J (2008) Experiments on selection of
cessing systems, vol 19. MIT Press, Cambridge, pp 985–992 codebooks for local image feature histograms. In: Proceedings of
13. Nister D, Stewenius H (2006) Scalable recognition with a the 10th international conference series on visual information
vocabulary tree. In: Proceedings of the IEEE computer society systems, Salerno, Italy, pp 126–137
conference on computer vision and pattern recognition, New 25. Winn J, Criminisi A, Minka T (2005) Object categorization by
York, NY, pp 2161–2168 learned universal visual dictionary. In: Proceedings of the 10th
14. Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of- IEEE international conference on computer vision, Beijing,
features image classification. In: Proceedings of the 9th European China, pp 1800–1807
conference on computer vision, Graz, Austria, pp 490–503
123