Shape Matching and Object Recognition Using Shape Contexts
Shape Matching and Object Recognition Using Shape Contexts
Shape Matching and Object Recognition Using Shape Contexts
AbstractÐWe present a novel approach to measuring similarity between shapes and exploit it for object recognition. In our
framework, the measurement of similarity is preceded by 1) solving for correspondences between points on the two shapes, 2) using
the correspondences to estimate an aligning transform. In order to solve the correspondence problem, we attach a descriptor, the
shape context, to each point. The shape context at a reference point captures the distribution of the remaining points relative to it, thus
offering a globally discriminative characterization. Corresponding points on two similar shapes will have similar shape contexts,
enabling us to solve for correspondences as an optimal assignment problem. Given the point correspondences, we estimate the
transformation that best aligns the two shapes; regularized thin-plate splines provide a flexible class of transformation maps for this
purpose. The dissimilarity between the two shapes is computed as a sum of matching errors between corresponding points, together
with a term measuring the magnitude of the aligning transform. We treat recognition in a nearest-neighbor classification framework as
the problem of finding the stored prototype shape that is maximally similar to that in the image. Results are presented for silhouettes,
trademarks, handwritten digits, and the COIL data set.
Index TermsÐShape, object recognition, digit recognition, correspondence problem, MPEG7, image registration, deformable
templates.
1 INTRODUCTION
Fig. 2. Example of coordinate transformations relating two fish, from D'Arcy Thompson's On Growth and Form [55]. Thompson observed that similar
biological forms could be related by means of simple mathematical transformations between homologous (i.e., corresponding) features. Examples of
homologous features include center of eye, tip of dorsal fin, etc.
BELONGIE ET AL.: SHAPE MATCHING AND OBJECT RECOGNITION USING SHAPE CONTEXTS 511
Silhouettes are fundamentally limited as shape descrip- when used in a probabilistic framework [38]. Murase and
tors for general objects; they ignore internal contours and Nayar applied these ideas to 3D object recognition [40].
are difficult to extract from real images. More promising are Several authors have applied discriminative classification
approaches that treat the shape as a set of points in the methods in the appearance-based shape matching frame-
2D image. Extracting these from an image is less of a work. Some examples are the LeNet classifier [34], a
problemÐe.g., one can just use an edge detector. Hutten- convolutional neural network for handwritten digit recogni-
locher et al. developed methods in this category based on tion, and the Support Vector Machine (SVM)-based methods
the Hausdorff distance [23]; this can be extended to deal
of [41] (for discriminating between templates of pedestrians
with partial matching and clutter. A drawback for our
based on 2D wavelet coefficients) and [11], [7] (for hand-
purposes is that the method does not return correspon-
written digit recognition). The MNIST database of hand-
dences. Methods based on Distance Transforms, such as
[16], are similar in spirit and behavior in practice. written digits is a particularly important data set as many
The work of Sclaroff and Pentland [50] is representative different pattern recognition algorithms have been tested on
of the eigenvector- or modal-matching based approaches; it. We will show our results on MNIST in Section 6.1.
see also [52], [51], [57]. In this approach, sample points in
the image are cast into a finite element spring-mass model 3 MATCHING WITH SHAPE CONTEXTS
and correspondences are found by comparing modes of
vibration. Most closely related to our approach is the work In our approach, we treat an object as a (possibly infinite)
of Gold et al. [19] and Chui and Rangarajan [9], which is point set and we assume that the shape of an object is
discussed in Section 3.4. essentially captured by a finite subset of its points. More
There have been several approaches to shape recognition practically, a shape is represented by a discrete set of points
based on spatial configurations of a small number of sampled from the internal or external contours on the
keypoints or landmarks. In geometric hashing [32], these object. These can be obtained as locations of edge pixels as
configurations are used to vote for a model without found by an edge detector, giving us a set P fp1 ; . . . ; pn g,
explicitly solving for correspondences. Amit et al. [1] train pi 2 IR2 , of n points. They need not, and typically will not,
decision trees for recognition by learning discriminative correspond to key-points such as maxima of curvature or
spatial configurations of keypoints. Leung et al. [35], inflection points. We prefer to sample the shape with
Schmid and Mohr [49], and Lowe [36] additionally use roughly uniform spacing, though this is also not critical.1
gray-level information at the keypoints to provide greater Figs. 3a and 3b show sample points for two shapes.
discriminative power. It should be noted that not all objects Assuming contours are piecewise smooth, we can obtain
have distinguished key points (think of a circle for as good an approximation to the underlying continuous
instance), and using key points alone sacrifices the shape shapes as desired by picking n to be sufficiently large.
information available in smooth portions of object contours.
3.1 Shape Context
2.2 Brightness-Based Methods For each point pi on the first shape, we want to find the
Brightness-based (or appearance-based) methods offer a ªbestº matching point qj on the second shape. This is a
complementary view to feature-based methods. Instead of correspondence problem similar to that in stereopsis.
focusing on the shape of the occluding contour or other Experience there suggests that matching is easier if one
extracted features, these approaches make direct use of the uses a rich local descriptor, e.g., a gray-scale window or a
gray values within the visible portion of the object. One can vector of filter outputs [27], instead of just the brightness at
use brightness information in one of two frameworks. a single pixel or edge location. Rich descriptors reduce the
In the first category, we have the methods that explicitly ambiguity in matching.
As a key contribution, we propose a novel descriptor, the
find correspondences/alignment using grayscale values.
shape context, that could play such a role in shape matching.
Yuille [61] presents a very flexible approach in that
Consider the set of vectors originating from a point to all
invariance to certain kinds of transformations can be built
other sample points on a shape. These vectors express the
into the measure of model similarity, but it suffers from the
configuration of the entire shape relative to the reference
need for human-designed templates and the sensitivity to
point. Obviously, this set of n 1 vectors is a rich
initialization when searching via gradient descent. Lades et
description, since as n gets large, the representation of the
al. [31] use elastic graph matching, an approach that shape becomes exact.
involves both geometry and photometric features in the The full set of vectors as a shape descriptor is much too
form of local descriptors based on Gaussian derivative jets. detailed since shapes and their sampled representation may
Vetter et al. [59] and Cootes et al. [10] compare brightness vary from one instance to another in a category. We identify
values but first attempt to warp the images onto one the distribution over relative positions as a more robust and
another using a dense correspondence field. compact, yet highly discriminative descriptor. For a point pi
The second category includes those methods that build on the shape, we compute a coarse histogram hi of the
classifiers without explicitly finding correspondences. In relative coordinates of the remaining n 1 points,
such approaches, one relies on a learning algorithm having
enough examples to acquire the appropriate invariances. In hi
k #fq 6 pi :
q pi 2 bin
kg:
1
the area of face recognition, good results were obtained using
principal components analysis (PCA) [54], [56] particularly 1. Sampling considerations are discussed in Appendix B.
512 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002
Fig. 3. Shape context computation and matching. (a) and (b) Sampled edge points of two shapes. (c) Diagram of log-polar histogram bins used in
computing the shape contexts. We use five bins for log r and 12 bins for . (d), (e), and (f) Example shape contexts for reference samples marked by
; ; / in (a) and (b). Each shape context is a log-polar histogram of the coordinates of the rest of the point set measured using the reference point as
the origin. (Dark=large value.) Note the visual similarity of the shape contexts for and , which were computed for relatively similar points on the two
shapes. By contrast, the shape context for / is quite different. (g) Correspondences found using bipartite matching, with costs defined by the 2
distance between histograms.
This histogram is defined to be the shape context of pi . We 3.2 Bipartite Graph Matching
use bins that are uniform in log-polar2 space, making the Given the set of costs Cij between all pairs of points pi on
descriptor more sensitive to positions of nearby sample the first shape and qj on the second shape, we want to
points than to those of points farther away. An example is minimize the total cost of matching,
shown in Fig. 3c. X
Consider a point pi on the first shape and a point qj on H
C pi ; q
i
2
the second shape. Let Cij C
pi ; qj denote the cost of i
matching these two points. As shape contexts are subject to the constraint that the matching be one-to-one, i.e.,
distributions represented as histograms, it is natural to is a permutation. This is an instance of the square assignment
use the 2 test statistic: (or weighted bipartite matching) problem, which can be
solved in O
N 3 time using the Hungarian method [42]. In our
1X K
hi
k hj
k2
Cij C
pi ; qj ; experiments, we use the more efficient algorithm of [28]. The
2 k1 hi
k hj
k input to the assignment problem is a square cost matrix with
where hi
k and hj
k denote the K-bin normalized entries Cij . The result is a permutation
i such that (2) is
histogram at pi and qj , respectively.3 minimized.
The cost Cij for matching points can include an In order to have robust handling of outliers, one can add
additional term based on the local appearance similarity at ªdummyº nodes to each point set with a constant matching
points pi and qj . This is particularly useful when we are cost of d . In this case, a point will be matched to a
comparing shapes derived from gray-level images instead ªdummyº whenever there is no real match available at
of line drawings. For example, one can add a cost based on smaller cost than d . Thus, d can be regarded as a threshold
normalized correlation scores between small gray-scale parameter for outlier detection. Similarly, when the number
patches centered at pi and qj , distances between vectors of of sample points on two shapes is not equal, the cost matrix
filter outputs at pi and qj , tangent orientation difference can be made square by adding dummy nodes to the smaller
between pi and qj , and so on. The choice of this appearance point set.
similarity term is application dependent, and is driven by
3.3 Invariance and Robustness
the necessary invariance and robustness requirements, e.g.,
A matching approach should be 1) invariant under scaling
varying lighting conditions make reliance on gray-scale
and translation, and 2) robust under small geometrical
brightness values risky.
distortions, occlusion and presence of outliers. In certain
2. This choice corresponds to a linearly increasing positional uncertainty applications, one may want complete invariance under
with distance from pi , a reasonable result if the transformation between the rotation, or perhaps even the full group of affine transfor-
shapes around pi can be locally approximated as affine.
3. Alternatives include Bickel's generalization of the Kolmogorov- mations. We now evaluate shape context matching by these
Smirnov test for 2D distributions [4], which does not require binning. criteria.
BELONGIE ET AL.: SHAPE MATCHING AND OBJECT RECOGNITION USING SHAPE CONTEXTS 513
Invariance to translation is intrinsic to the shape context 3D points called the ªspin image.º A spin image is a
definition since all measurements are taken with respect to 2D histogram formed by spinning a plane around a normal
points on the object. To achieve scale invariance we vector on the surface of the object and counting the points
normalize all radial distances by the mean distance that fall inside bins in the plane. As the size of this plane is
between the n2 point pairs in the shape. relatively small, the resulting signature is not as informative
Since shape contexts are extremely rich descriptors, they as a shape context for purposes of recovering correspon-
are inherently insensitive to small perturbations of parts of dences. This characteristic, however, might have the trade-
the shape. While we have no theoretical guarantees here, off of additional robustness to occlusion. In another related
robustness to small nonlinear transformations, occlusions work, Carlsson [8] has exploited the concept of order
and presence of outliers is evaluated experimentally in structure for characterizing local shape configurations. In
Section 4.2. this work, the relationships between points and tangent
In the shape context framework, we can provide for lines in a shape are used for recovering correspondences.
complete rotation invariance, if this is desirable for an
application. Instead of using the absolute frame for
computing the shape context at each point, one can use a 4 MODELING TRANSFORMATIONS
relative frame, based on treating the tangent vector at each Given a finite set of correspondences between points on two
point as the positive x-axis. In this way, the reference frame shapes, one can proceed to estimate a plane transformation
turns with the tangent angle, and the result is a completely T : IR2 !IR2 that may be used to map arbitrary points from
rotation invariant descriptor. In Appendix A, we demon- one shape to the other. This idea is illustrated by the
strate this experimentally. It should be emphasized though warped gridlines in Fig. 2, wherein the specified corre-
that, in many applications, complete invariance impedes spondences consisted of a small number of landmark points
recognition performance, e.g., when distinguishing 6 from 9 such as the centers of the eyes, the tips of the dorsal fins,
rotation invariance would be completely inappropriate. etc., and T extends the correspondences to arbitrary points.
Another drawback is that many points will not have well- We need to choose T from a suitable family of
defined or reliable tangents. Moreover, many local appear- transformations. A standard choice is the affine model, i.e.,
ance features lose their discriminative power if they are not
measured in the same coordinate system. T
x Ax o
3
Additional robustness to outliers can be obtained by
with some matrix A and a translational offset vector o
excluding the estimated outliers from the shape context
parameterizing the set of all allowed transformations. Then,
computation. More specifically, consider a set of points that ^ o^ is obtained by
have been labeled as outliers on a given iteration. We the least squares solution T^
A;
render these points ªinvisibleº by not allowing them to 1X n
contribute to any histogram. However, we still assign them o^ pi q
i ;
4
n i1
shape contexts, taking into account only the surrounding
inlier points, so that at a later iteration they have a chance of ^
Q P t ;
A
5
reemerging as an inlier.
where P and Q contain the homogeneous coordinates of P
3.4 Related Work and Q, respectively, i.e.,
The most comprehensive body of work on shape corre- 0 1
1 p11 p12
spondence in this general setting is the work of Gold et al. B. .. .. C
[19] and Chui and Rangarajan [9]. They developed an P @ .. . . A:
6
iterative optimization algorithm to determine point corre- 1 pn1 pn2
spondences and underlying image transformations jointly,
where typically some generic transformation class is Here, Q denotes the pseudoinverse of Q.
assumed, e.g., affine or thin plate splines. The cost function In this work, we mostly use the thin plate spline (TPS)
that is being minimized is the sum of Euclidean distances model [14], [37], which is commonly used for representing
between a point on the first shape and the transformed flexible coordinate transformations. Bookstein [6] found it
second shape. This sets up a chicken-and-egg problem: The to be highly effective for modeling changes in biological
distances make sense only when there is at least a rough forms. Powell applied the TPS model to recover transfor-
alignment of shape. Joint estimation of correspondences mations between curves [44]. The thin plate spline is the
and shape transformation leads to a difficult, highly non- 2D generalization of the cubic spline. In its regularized
convex optimization problem, which is solved using form, which is discussed below, the TPS model includes the
deterministic annealing [19]. The shape context is a very affine model as a special case. We will now provide some
discriminative point descriptor, facilitating easy and robust background information on the TPS model.
correspondence recovery by incorporating global shape We start with the 1D interpolation problem. Let vi denote
information into a local descriptor. the target function values at corresponding locations pi
As far as we are aware, the shape context descriptor and
xi ; yi in the plane, with i 1; 2; . . . ; n. In particular, we
its use for matching 2D shapes is novel. The most closely will set vi equal to x0i and y0i in turn to obtain one continuous
related idea in past work is that due to Johnson and Hebert transformation for each coordinate. We assume that the
[26] in their work on range images. They introduced a locations
xi ; yi are all different and are not collinear. The
representation for matching dense clouds of oriented TPS interpolant f
x; y minimizes the bending energy
514 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002
Z Z 2 2 2 2 2
@2f @ f @ f We use two separate TPS functions to model a coordinate
If 2 dxdy
IR 2
@x2 @x@y @y2 transformation,
and has the form: T
x; y fx
x; y; fy
x; y
11
X
n
which yields a displacement field that maps any position in
f
x; y a1 ax x ay y wi U
k
xi ; yi
x; yk;
i1
the first image to its interpolated location in the second
image.
where the kernel function U
r is defined by U
r r2 log r2 In many cases, the initial estimate of the correspondences
and U
0 0 as usual. In order for f
x; y to have square contains some errors which could degrade the quality of the
integrable second derivatives, we require that transformation estimate. The steps of recovering correspon-
X
n X
n X
n dences and estimating transformations can be iterated to
wi 0 and wi xi wi yi 0:
7 overcome this problem. We usually use a fixed number of
i1 i1 i1 iterations, typically three in large-scale experiments, but
Together with the interpolation conditions, f
xi ; yi vi , more refined schemes are possible. However, experimental
experiences show that the algorithmic performance is
this yields a linear system for the TPS coefficients:
independent of the details. An example of the iterative
algorithm is illustrated in Fig. 4.
Fig. 4. Illustration of the matching process applied to the example of Fig. 1. Top row: 1st iteration. Bottom row: 5th iteration. Left column: estimated
correspondences shown relative to the transformed model, with tangent vectors shown. Middle column: estimated correspondences shown relative to
the untransformed model. Right column: result of transforming the model based on the current correspondences; this is the input to the next iteration.
The grid points illustrate the interpolated transformation over IR2 . Here, we have used a regularized TPS model with o 1.
examples rather than a set of formal logical rules. As an 1 and K=n ! 0, the error ! E ). This is interesting
example, a sparrow is a likely prototype for the category of because it shows that the humble nearest-neighbor classifier
birds; a less likely choice might be an penguin. The idea of is asymptotically optimal, a property not possessed by
prototypes allows for soft category membership, meaning several considerably more complicated techniques. Of
that as one moves farther away from the ideal example in course, what matters in practice is the performance for
some suitably defined similarity space, one's association with small n, and this gives us a way to compare different
that prototype falls off. When one is sufficiently far away from similarity/distance measures.
that prototype, the distance becomes meaningless, but by
then one is most likely near a different prototype. As an 5.1 Shape Distance
example, one can talk about good or so-so examples of the In this section, we make precise our definition of shape
color red, but when the color becomes sufficiently different, distance and apply it to several practical problems. We used
the level of dissimilarity saturates at some maximum level a regularized TPS transformation model and three iterations
rather than continuing on indefinitely. of shape context matching and TPS reestimation. After
Prototype-based recognition translates readily into the matching, we estimated shape distances as the weighted
computational framework of nearest-neighbor methods sum of three terms: shape context distance, image appear-
using multiple stored views. Nearest-neighbor classifiers ance distance, and bending energy.
have the property [46] that as the number of examples n in We measure shape context distance between shapes P
the training set goes to infinity, the 1-NN error converges to and Q as the symmetric sum of shape context matching
a value 2E , where E is the Bayes Risk (for K-NN, K ! costs over best matching points, i.e.,
Fig. 5. Testing data for empirical robustness evaluation, following Chui and Rangarajan [9]. The model pointsets are shown in the first column.
Columns 2-4 show examples of target point sets for the deformation, noise, and outlier tests, respectively.
516 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002
Fig. 6. Comparison of our results (u t) to Chui and Rangarajan () and iterated closest point () for the fish and Chinese character, respectively. The error
bars indicate the standard deviation of the error over 100 random trials. Here, we have used 5 iterations with o 1:0. In the deformation and noise
tests no dummy nodes were added. In the outlier test, dummy nodes were added to the model point set such that the total number of nodes was equal
to that of the target. In this case, the value of d does not affect the solution.
Fig. 7. Handwritten digit recognition on the MNIST data set. Left: Test set errors of a 1-NN classifier using SSD and Shape Distance (SD) measures.
Right: Detail of performance curve for Shape Distance, including results with training set sizes of 15,000 and 20,000. Results are shown on a
semilog-x scale for K 1; 3; 5 nearest-neighbors.
1. group S 1 into classes given the class prototypes p
c, Cijtan 0:5
1 cos
i j measures tangent angle dissim-
and ilarity, and 0:1. For recognition, we used a K-NN
2. identify a representative prototype for each class classifier with a distance function
given the elements in the cluster.
Basically, item 1 is solved by assigning each shape P 2 S 1 to D 1:6Dac Dsc 0:3Dbe :
19
the nearest prototype, thus The weights in (19) have been optimized by a leave-one-out
procedure on a 3; 000 3; 000 subset of the training data.
c
P arg min D
P; p
k:
16
k On the MNIST data set nearly 30 algorithms have been
For given classes, in item 2 new prototypes are selected compared (http://www.research.att.com/~yann/exdb/
based on minimal mean dissimilarity, i.e., mnist/index.html). The lowest test set error rate published
X at this time is 0.7 percent for a boosted LeNet-4 with a
p
k arg min D
P; p:
17 training set of size 60; 000 10 synthetic distortions per
p2S 2
P:c
shapek
training digit. Our error rate using 20,000 training examples
Since both steps minimize the same cost function and 3-NN is 0.63 percent. The 63 errors are shown in Fig. 8.4
X As mentioned earlier, what matters in practical applica-
H
c; p D
P; p
c
P;
18 tions of nearest-neighbor methods is the performance for
P2S 1
small n, and this gives us a way to compare different
the algorithm necessarily converges to a (local) minimum. similarity/distance measures. In Fig. 7 (left), our shape
As with most clustering methods, with k-medoids one distance is compared to SSD (sum of squared differences
must have a strategy for choosing k. We select the number between pixel brightness values). In Fig. 7 (right), we compare
of prototypes using a greedy splitting strategy starting with the classification rates for different K.
one prototype per category. We choose the cluster to split
based on the associated overall misclassification error. This 6.2 3D Object Recognition
continues until the overall misclassification error has Our next experiment involves the 20 common household
dropped below a criterion level. Thus, the prototypes are objects from the COIL-20 database [40]. Each object was
automatically allocated to the different object classes, thus placed on a turntable and photographed every 5 for a total
optimally using available resources. The application of this of 72 views per object. We prepared our training sets by
procedure to a set of views of 3D objects is explored in selecting a number of equally spaced views for each object
Section 6.2 and illustrated in Fig. 10. and using the remaining views for testing. The matching
algorithm is exactly the same as for digits. Recall, that the
6 CASE STUDIES Canny edge detector responds both to external and internal
6.1 Digit Recognition contours, so the 100 sample points are not restricted to the
Here, we present results on the MNIST data set of hand- external boundary of the silhouette.
written digits, which consists of 60,000 training and 10,000 test Fig. 9 shows the performance using 1-NN with the
digits [34]. In the experiments, we used 100 points sampled distance function D as given in (19) compared to a
from the Canny edges to represent each digit. When
4. DeCoste and SchoÈlkopf [13] report an error rate of 0.56 percent on the
computing the Cij 's for the bipartite matching, we included same database using Virtual Support Vectors (VSV) with the full training
a term representing the dissimilarity of local tangent set of 60,000. VSVs are found as follows: 1) obtain SVs from the original
training set using a standard SVM, 2) subject the SVs to a set of desired
angles. Specifically, we defined the matching cost as transformations (e.g., translation), 3) train another SVM on the generated
Cij
1 Cijsc Cijtan , where Cijsc is the shape context cost, examples.
518 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002
Fig. 8. All of the misclassified MNIST test digits using our method (63 out of 10,000). The text above each digit indicates the example number
followed by the true label and the assigned label.
straightforward sum of squared differences (SSD). SSD error rate with an average of only four two-dimensional
performs very well on this easy database due to the lack of views for each three-dimensional object, thanks to the
variation in lighting [24] (PCA just makes it faster). flexibility provided by the matching algorithm.
The prototype selection algorithm is illustrated in Fig. 10.
As seen, views are allocated mainly for more complex
6.3 MPEG-7 Shape Silhouette Database
categories with high within class variability. The curve Our next experiment involves the MPEG-7 shape silhouette
marked SC-proto in Fig. 9 shows the improved classification database, specifically Core Experiment CE-Shape-1 part B,
performance using this prototype selection strategy instead which measures performance of similarity-based retrieval
of equally-spaced views. Note that we obtain a 2.4 percent [25]. The database consists of 1,400 images: 70 shape
categories, 20 images per category. The performance is
measured using the so-called ªbullseye test,º in which each
Fig. 11. Examples of shapes in the MPEG7 database for three different
categories.
ACKNOWLEDGMENTS
This research is supported by the Army Research Office
(ARO) DAAH04-96-1-0341, the Digital Library Grant IRI-
Fig. 13. Kimia data set: each row shows instances of a different object 9411334, a US National Science Foundation Graduate
category. Performance is measured by the number of closest matches Fellowship for S. Belongie, and the German Research
with the correct category label. Note that several of the categories require
rotation invariant matching for effective recognition. All of the 1st ranked Foundation by DFG grant PU-165/1. Parts of this work
closest matches were correct using our method. Of the 2nd ranked have appeared in [3], [2]. The authors wish to thank H. Chui
matches, one error occurred in 1 versus 8. In the 3rd ranked matches, and A. Rangarajan for providing the synthetic testing data
confusions arose from 2 versus 8, 8 versus 1, and 15 versus 17.
used in Section 4.2. We would also like to thank them and
various members of the Berkeley computer vision group,
invariance to several common image transformations, in-
particularly A. Berg, A. Efros, D. Forsyth, T. Leung, J. Shi,
cluding significant 3D rotations of real-world objects. and Y. Weiss, for useful discussions. This work was carried
out while the authors were with the Department of
APPENDIX A Electrical Engineering and Computer Science Division,
COMPLETE ROTATION INVARIANT RECOGNITION University of California at Berkeley.
In this appendix, we demonstrate the use of the relative
frame in our approach as a means of obtaining complete REFERENCES
rotation invariance. To demonstrate this idea, we have used [1] Y. Amit, D. Geman, and K. Wilder, ªJoint Induction of Shape
the database provided by Sharvit et al. [53] shown in Fig. 13. Features and Tree Classifiers,º IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 19, no. 11, pp. 1300-1305, Nov. 1997.
In this experiment, we used n 100 sample points and, as [2] S. Belongie, J. Malik, and J. Puzicha, ªMatching Shapes,º Proc.
mentioned above, we used the relative frame (Section 3.3) Eighth Int'l. Conf. Computer Vision, pp. 454-461, July 2001.
when computing the shape contexts. We used five bins for [3] S. Belongie, J. Malik, and J. Puzicha, ªShape Context: A New
Descriptor for Shape Matching and Object Recognition,º Advances
log
r over the range 0:125 to 2 and 12 equally spaced in Neural Information Processing Systems 13: Proc. 2000 Conf., T.K.
radial bins in these and all other experiments in this paper. Leen, T.G. Dietterich, and V. Tresp, eds.. pp. 831-837, 2001.
No transformation model at all was used. As a similarity [4] P.J. Bickel, ªA Distribution Free Version of the Smirnov Two-
P Sample Test in the Multivariate Case,º Annals of Math. Statistics,
score, we used the matching cost function i Ci;
i after vol. 40, pp. 1-23, 1969.
one iteration with no transformation step. Thus, this [5] F.L. Bookstein, ªPrincipal Warps: Thin-Plate Splines and Decom-
experiment is specifically designed solely to evaluate the position of Deformations,º IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 11, no. 6, pp. 567-585, June 1989.
power of the shape descriptor in the face of rotation. [6] F.L. Bookstein, Morphometric Tools for Landmark Data: Geometry and
In [53] and [17], the authors summarize their results on Biology. Cambridge Univ. Press, 1991.
this data set by stating the number of 1st, 2nd, and 3rd [7] C. Burges and B. SchoÈlkopf, ªImproving the Accuracy and Speed
of Support Vector Machines,º Advances in Neural Information
nearest-neighbors that fall into the correct category. Our Processing Systems 9: Proc. 1996 Conf., D.S. Touretzky, M.C. Mozer,
results are 25/25, 24/25, 22/25. In [53] and [17], the results and M. E. Hasselmo, eds. pp. 375-381, 1997.
quoted are 23/25, 21/25, 20/25 and 25/25, 21/25, 19/25, [8] S. Carlsson, ªOrder Structure, Correspondence and Shape Based
Categories,º Int'l Workshop Shape, Contour and Grouping, May 1999.
respectively. [9] H. Chui and A. Rangarajan, ªA New Algorithm for Non-Rigid
Point Matching,º Proc. IEEE Conf. Computer Vision and Pattern
Recognition, pp. 44-51, June 2000.
APPENDIX B [10] T. Cootes, D. Cooper, C. Taylor, and J. Graham, ªActive Shape
ModelsÐTheir Training and Application,º Computer Vision and
SAMPLING CONSIDERATIONS Image Understanding (CVIU), vol. 61, no. 1, pp. 38-59, Jan. 1995.
In our approach, a shape is represented by a set of sample [11] C. Cortes and V. Vapnik, ªSupport Vector Networks,º Machine
Learning, vol. 20, pp. 273-297, 1995.
points drawn from the internal and external contours of an [12] Nearest Neighbor (NN) Norms: (NN) Pattern Classification Techniques.
object. Operationally, one runs an edge detector on the B.V. Dasarathy, ed., IEEE Computer Soc., 1991.
BELONGIE ET AL.: SHAPE MATCHING AND OBJECT RECOGNITION USING SHAPE CONTEXTS 521
[13] D. DeCoste and B. SchoÈlkopf, ªTraining Invariant Support Vector [38] B. Moghaddam, T. Jebara, and A. Pentland, ªBayesian Face
Machines,º Machine Learning, to appear in 2002. Recognition,º Pattern Recognition, vol. 33, no. 11, pp. 1771-1782,
[14] J. Duchon, ªSplines Minimizing Rotation-Invariant Semi-Norms in Nov, 2000.
Sobolev Spaces,º Constructive Theory of Functions of Several [39] F. Mokhtarian, S. Abbasi, and J. Kittler, ªEfficient and Robust
Variables, W. Schempp and K. Zeller, eds., pp. 85-100, Berlin: Retrieval by Shape Content Through Curvature Scale Space,º
Springer-Verlag, 1977. Image Databases and Multi-Media Search, A.W.M. Smeulders and
[15] M. Fischler and R. Elschlager, ªThe Representation and Matching R. Jain, eds., pp. 51-58, World Scientific, 1997.
of Pictorial Structures,º IEEE Trans. Computers, vol. 22, no. 1, [40] H. Murase and S. Nayar, ªVisual Learning and Recognition of 3-D
pp. 67-92, 1973. Objects from Oppearance,º Int'l. J. Computer Vision, vol. 14, no. 1,
[16] D. Gavrila and V. Philomin, ªReal-Time Object Detection for pp. 5-24, Jan. 1995.
Smart Vehicles,º Proc. Seventh Int'l. Conf. Computer Vision, pp. 87- [41] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio,
93, 1999. ªPedestrian Detection Using Wavelet Templates,º Proc. IEEE Conf.
Computer Vision and Pattern Recognition, pp. 193-199, June 1997.
[17] Y. Gdalyahu and D. Weinshall, ªFlexible Syntactic Matching of
[42] C. Papadimitriou and K. Stieglitz, Combinatorial Optimization:
Curves and its Application to Automatic Hierarchical Classifica-
Algorithms and Complexity. Prentice Hall, 1982.
tion of Silhouettes,º IEEE Trans. Pattern Analysis and Machine
[43] E. Persoon and K. Fu, ªShape Discrimination Using Fourier
Intelligence, vol. 21, no. 12, pp. 1312-1328, Dec. 1999.
Descriptors,º IEEE Trans. Systems, Man and Cybernetics, vol. 7, no. 3,
[18] F. Girosi, M. Jones, and T. Poggio, ªRegularization Theory and pp. 170-179, Mar. 1977.
Neural Networks Architectures,º Neural Computation, vol. 7, no. 2, [44] M.J.D. Powell, ªA Thin Plate Spline Method for Mapping Curves
pp. 219-269, 1995. into Curves in Two Dimensions,º Computational Techniques and
[19] S. Gold, A. Rangarajan, C.-P. Lu, S. Pappu, and E. Mjolsness, Applications (CTAC '95), 1995.
ªNew Algorithms for 2D and 3D Point Matching: Pose Estimation [45] B.D. Ripley, ªModelling Spatial Patterns,º J. Royal Statistical
and Correspondence,º Pattern Recognition, vol. 31, no. 8, 1998. Society, Series B, vol. 39, pp. 172-212, 1977.
[20] E. Goldmeier, ªSimilarity in Visually Perceived Forms,º Psycho- [46] B.D. Ripley, Pattern Recognition and Neural Networks. Cambridge
logical Issues, vol. 8, no. 1, pp. 1-135, 1936/1972. Univ. Press, 1996.
[21] U. Grenander, Y. Chow, and D. Keenan, HANDS: A Pattern [47] E. Rosch, ªNatural Categories,º Cognitive Psychology, vol. 4, no. 3,
Theoretic Study Of Biological Shapes. Springer, 1991. pp. 328-350, 1973.
[22] M. Hagedoorn, ªPattern Matching Using Similarity Measures,º [48] E. Rosch, C.B. Mervis, W.D. Gray, D.M. Johnson, and P. Boyes-
PhD thesis, Universiteit Utrecht, 2000. Braem, ªBasic Objects in Natural Categories,º Cognitive Psychol-
[23] D. Huttenlocher, G. Klanderman, and W. Rucklidge, ªComparing ogy, vol. 8, no. 3, pp. 382-439, 1976.
Images Using the Hausdorff Distance,º IEEE Trans. Pattern [49] C. Schmid and R. Mohr, ªLocal Grayvalue Invariants for Image
Analysis and Machine Intelligence, vol. 15, no. 9, pp. 850-863, Sept. Retrieval,º IEEE Trans. Pattern Analysis and Machine Intelligence,
1993. vol. 19, no. 5, pp. 530-535, May 1997.
[24] D. Huttenlocher, R. Lilien, and C. Olson, ªView-Based Recogni- [50] S. Sclaroff and A. Pentland, ªModal Matching for Correspondence
tion Using an Eigenspace Approximation to the Hausdorff and Recognition,º IEEE Trans. Pattern Analysis and Machine
Measure,º IEEE Trans. Pattern Analysis and Machine Intelligence, Intelligence, vol. 17, no. 6, pp. 545-561, June 1995.
vol. 21, no. 9, pp. 951-955, Sept. 1999. [51] G. Scott and H. Longuet-Higgins, ªAn Algorithm for Associating
the Features of Two Images,º Proc. Royal Soc. London, vol. 244,
[25] S. Jeannin and M. Bober, ªDescription of Core Experiments for
pp. 21-26, 1991.
MPEG-7 Motion/Shape,º Technical Report ISO/IEC JTC 1/SC
[52] L.S. Shapiro and J. M. Brady, ªFeature-Based Correspondence: An
29/WG 11 MPEG99/N2690, MPEG-7, Seoul, Mar. 1999.
Eigenvector Approach,º Image and Vision Computing, vol. 10, no. 5,
[26] A.E. Johnson and M. Hebert, ªRecognizing Objects by Matching pp. 283-288, June 1992.
Oriented Points,º Proc. IEEE Conf. Computer Vision and Pattern [53] D. Sharvit, J. Chan, H. Tek, and B. Kimia, ªSymmetry-Based
Recognition, pp. 684-689, 1997. Indexing of Image Databases,º J. Visual Comm. and Image
[27] D. Jones and J. Malik, ªComputational Framework to Determining Representation, vol. 9, no. 4, pp. 366-380, Dec. 1998.
Stereo Correspondence from a Set of Linear Spatial Filters,º Image [54] L. Sirovich and M. Kirby, ªLow Dimensional Procedure for the
and Vision Computing, vol. 10, no. 10, pp. 699-708, Dec. 1992. Characterization of Human Faces,º J. Optical Soc. Am. A, vol. 4,
[28] R. Jonker and A. Volgenant, ªA Shortest Augmenting Path no. 3, pp. 519-524, 1987.
Algorithm for Dense and Sparse Linear Assignment Problems,º [55] D.W. Thompson, On Growth and Form, Cambridge Univ. Press,
Computing, vol. 38, pp. 325-340, 1987. 1917.
[29] D. Kendall, ªShape Manifolds, Procrustean Metrics and Complex [56] M. Turk and A. Pentland, ªEigenfaces for Recognition,º J. Cognitive
Projective Spaces,º Bull. London Math. Soc., vol. 16, pp. 81-121, Neuroscience, vol. 3, no. 1, pp. 71-96, 1991.
1984. [57] S. Umeyama, ªAn Eigen Decomposition Approach to Weighted
[30] J.J. Koenderink and A.J. van Doorn, ªThe Internal Representation Graph Matching Problems,º IEEE Trans. Pattern Analysis and
of Solid Shape with Respect to Vision,º Biological Cybernetics, Machine Intelligence, vol. 10, no. 5, pp. 695-703, Sept. 1988.
vol. 32, pp. 211-216, 1979. [58] R.C. Veltkamp and M. Hagedoorn, ªState of the Art in Shape
[31] M. Lades, C. VorbuÈggen, J. Buhmann, J. Lange, C. von der Matching,º Technical Report UU-CS-1999-27, Utrecht, 1999.
Malsburg, R. Wurtz, and W. Konen, ªDistortion Invariant Object [59] T. Vetter, M.J. Jones, and T. Poggio, ªA Bootstrapping Algorithm
Recognition in the Dynamic Link Architecture,º IEEE Trans. for Learning Linear Models of Object Classes,º Proc. IEEE Conf.
Computers, vol. 42, no. 3, pp. 300-311, Mar. 1993. Computer Vision and Pattern Recognition, pp. 40-46, 1997.
[32] Y. Lamdan, J. Schwartz, and H. Wolfson, ªAffine Invariant Model- [60] G. Wahba, Spline Models for Observational Data. Soc. Industrial and
Based Object Recognition,º IEEE Trans. Robotics and Automation, Applied Math., 1990.
vol. 6, pp. 578-589, 1990. [61] A. Yuille, ªDeformable Templates for Face Recognition,º
[33] L.J. Latecki, R. LakaÈmper, and U. Eckhardt, ªShape Descriptors for J. Cognitive Neuroscience, vol. 3, no. 1, pp. 59-71, 1991.
Non-Rigid Shapes with a Single Closed Contour,º Proc. IEEE Conf. [62] C. Zahn and R. Roskies, ªFourier Descriptors for Plane Closed
Computer Vision and Pattern Recognition, pp. 424-429, 2000. Curves,º IEEE Trans. Computers, vol. 21, no. 3, pp. 269-281, Mar.
1972.
[34] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ªGradient-Based
Learning Applied to Document Recognition,º Proc. IEEE, vol. 86,
no. 11, pp. 2278-2324, Nov. 1998.
[35] T.K. Leung, M.C. Burl, and P. Perona, ªFinding Faces in Cluttered
Scenes Using Random Labelled Graph Matching,º Proc. Fifth Int'l.
Conf. Computer Vision, pp. 637-644, 1995.
[36] D.G. Lowe, ªObject Recognition from Local Scale-Invariant
Features,º Proc. Seventh Int'l. Conf. Computer Vision, pp. 1150-
1157, Sept. 1999.
[37] J. Meinguet, ªMultivariate Interpolation at Arbitrary Points Made
Simple,º J. Applied Math. Physics (ZAMP), vol. 5, pp. 439-468, 1979.
522 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002
Serge Belongie received the BS degree (with Jan Puzicha received the Diploma degree in
honor) in electrical engineering from the California 1995 and the PhD degree in computer science in
Institute of Technology, Pasadena, California, in 1999, both from the University of Bonn, Bonn,
1995, and the MS and PhD degrees in electrical Germany. He was with the Computer Vision and
engineering and computer sciences (EECS) at Pattern Recognition Group, University of Bonn,
the University of California at Berkeley, in 1997 from 1995 to 1999. In September 1999, he
and 2000, respectively. While at Berkeley, his joined the Computer Science Department, Uni-
research was supported by a National Science versity of California, Berkeley, as an Emmy
Foundation Graduate Research Fellowship and Noether Fellow of the German Science Founda-
the Chancellor's Opportunity Predoctoral Fellow- tion, where he is currently working on optimiza-
ship. He is also a cofounder of Digital Persona, Inc., and the principal tion methods for perceptual grouping and image segmentation. His
architect of the Digital Persona fingerprint recognition algorithm. He is research interests include computer vision, image processing, unsu-
currently an assistant professor in the Computer Science and Engineering pervised learning, data analysis, and data mining.
Department at the University of California at San Diego. His research
interests include computer vision, pattern recognition, and digital signal
processing. He is a member of the IEEE.