Object Detection Using Contour Segment Networks
Object Detection Using Contour Segment Networks
Object Detection Using Contour Segment Networks
1 Introduction
We aim at detecting and localizing objects in real, cluttered images, given a single hand-
drawn example as model of their shape. This example depicts the contour outlines of
an instance of the object class to be detected (e.g. bottles, figure 1d; or mugs, composed
by two outlines as in figure 5a).
The task presents several challenges. The image edges are not reliably extracted
from complex images of natural scenes. The contour of the desired object is typically
fragmented over several pieces, and sometimes parts are missing. Moreover, locally,
edges lack specificity, and can be recognized only when put in the wider context of
the whole shape [2]. In addition, the object often appears in cluttered images. Clutter,
combined with the need for a ‘global view’ of the shape, is the principal source of
difficulty. Finally, the object shape in the test image can differ considerably from the
one of the example, because of variations among instances within an object class (class
variability).
In this paper, we present a new approach to shape matching which addresses all
these issues, and is especially suited to detect objects in substantially cluttered im-
ages. We start by linking the image edges at their discontinuities, and partitioning them
into roughly straight contour segments (section 3). These segments are then connected
along the edges and across their links, to form the image representation at the core of
our method: the Contour Segment Network (section 4). By recording the segment inter-
connections, the network captures the underlying image structure, and enables to cast
object detection as finding paths through the network resembling the model outlines.
We propose a computationally efficient matching algorithm for this purpose (section 5).
⋆
T. Tuytelaars acknowledges support by the Fund for Scientific Research Flanders (Belgium).
The resulting, possibly partial, paths are combined into final detection hypotheses by a
dedicated integration stage (section 6).
Operating on the Contour Segment Network brings two key advantages. First, even
when most of the image is covered by clutter segments, only a limited number is con-
nected to a path corresponding to a model outline. As we detail in section 5, this greatly
limits the choices the matcher has to make, thus allowing to correctly locate objects
even in heavily cluttered images. Besides, it also makes the computational complexity
linear in the number of test image segments, making our system particularly efficient.
Second, since the network connects segments also over edge discontinuities, the system
is robust to interruptions along the object contours, and to short missing parts.
Our method accommodates considerable class variability by a flexible measure of
the similarity between configurations of segments, which focuses on their overall spa-
tial arrangement. This measure first guides the matching process towards network paths
similar to the model outlines, and is then used to evaluate the quality of the produced
paths and to integrate them into final detections. As other important features, our ap-
proach can find multiple object instances in the same image, produces point correspon-
dences, and handles large scale changes.
In section 7 we report results on detecting five diverse object classes over hundreds
of test images. Many of them are severely cluttered, in that the object contours form a
small minority of all image edges, and they comprise only a fraction of the image. Our
results compare favorably against a baseline Chamfer Matcher.
2 Previous work
c2
c1
c1
a b c d e
Fig. 1. (a-c) Example links between edgel-chains. (a) Endpoint-to-endpoint link. (b) Tangent-
continuous T-junction link. (c) Tangent-discontinuous link. (d) 8 segments on a bottle-shaped
edgel-chain. (e) A segment (marked with an arc) bridging over link b).
3 Early processing
Detecting and linking edgel-chains. Edgels are detected by the excellent Berkeley
natural boundary detector [15], which was recently successfully applied to object recog-
nition [3]. Next, edgels are chained and a smoothing spline curve is fit to each edgel-
chain, providing estimates of the edgels’ tangent orientations.
c2 c2
c2
c3
c2
s s
c1
c1
c1
c1
rule 1 rule 2 rule 3 rule 4 rule 5 rule 6
Fig. 2. The six rules to build the Contour Segment Network. They connect (arrows) regular seg-
ments and bridging segments (marked with an arc). Rules 2-6 connect segments over different
edgel-chains ci .
Due to the well-known brittleness of edge detection, a contour is often broken into
several edgel-chains. Besides, the ideal contour might have branchings, which are not
captured by simple edgel-chaining. We counter these issues by linking edgel-chains: an
edgel-chain c1 is linked to an edgel-chain c2 if any edgel of c2 lies within a search area
near an endpoint of c1 (figure 1). The search area is an isosceles trapezium. The minor
base rests on the endpoint of c1 , and is perpendicular to the curve’s tangent orientation,
while the height points away from c1 1 . This criterion links c1 to edgel-chains lying
in front of one of its endpoints, thereby indicating that it could continue over c2 . The
trapezium shape expresses that the uncertainty about the continuation of c1 ’s location
grows with the distance from the breakpoint . Note how c1 can link either to an endpoint
of c2 , or to an interior edgel. The latter allows to properly deal with T-junctions, as it
records that the curve could continue in two directions (figure 1b). Besides, we point
out that it is not necessary for the end of c1 to be oriented like the bit of c2 it links to (as
in figure 1b). Tangent-discontinuous links are also possible (figure 1c).
The edgel-chain links are the backbone structure on which the Contour Segment
Network will be built (section 4).
Contour segments. The elements composing the network are contour segments. These
are obtained by partitioning each edgel-chain into roughly straight segments. Figure 1d
shows the segmentation for a bottle-shaped edgel-chain. In addition to these regular
segments, we also construct segments bridging over tangent-continuous links between
edgel-chains. The idea is to bridge the breaks in the edges, thus recovering useful seg-
ments missed due to the breaks.
Fig. 3. Network connectedness. All black segments are connected to S, up to depth 8. They include
a path around the bottle (thick).
5 Basic matching
By processing the test image as described before, we obtain its Contour Segment Net-
work. We also segment the contour chains of the model, giving a set of contour segment
chains along the outlines of the object.
The detection problem can now be formulated as finding paths through the net-
work which resemble the model chains. Let’s first consider a subproblem, termed basic
matching: find the path most resembling a model chain, starting from a basis match
between a model segment and a test image segment. However we do not know a priori
where to start from, as the test image is usually covered by a large majority of clutter
segments. Therefore, we apply the basic matching algorithm described in this section,
starting from all pairs of model and test segment with roughly similar orientations. The
resulting paths are then inspected and integrated into full detection hypotheses in the
next section.
We consider the object transformation from the model to the test image to be com-
posed of a global pose change, plus shape variations due to class variability. The pose
change is modeled by a translation t and a scale change σ, while class variability is
accommodated by a flexible measure of the similarity between configurations of seg-
ments.
The basic matching algorithm. The algorithm starts with a basis match between a
model segment bm and a test segment bt , and then iteratively matches the other model
segments, thereby tracing out a path in the network. The matched path P initially only
contains {bm , bt }.
1. Compute the scale change σ of the basis match.
2. Move to the next model segment m. Points 3-6 will match it to a test segment.
3. Define a set C of candidate test segments. These are all successors 2 of the current
test segment in the network, and their successors (figure 4a). Including successors
at depth 2 brings robustness against spurious test segments which might lie along
the desired path.
2
All segments connected at its free endpoint, i.e. opposite the one connecting to P.
4. Evaluate the candidates. Each candidate is evaluated according to its orientation
similarity to m, how well it fits in the path P constructed so far, and how strong its
edgels are (more details below).
5. Extend the path. The best candidate cbest is matched to m and {m, cbest } is added
to P.
6. Update σ. Re-estimate the scale change over P (more details below).
7. Iterate. The algorithm iterates to point 2, until the end of the model segment chain,
or until the path comes to a dead end (C = ∅). At this point, the algorithm restarts
from the basis match, proceeding in the backward direction, so as to match the
model segments lying before the basis one.
For simplicity, the algorithm is presented above as greedy. In our actual implementa-
tion, we retain the best two candidates, and then evaluate their possible successors. The
candidate with the best sum of its own score and the score of the best successor wins.
As the algorithm looks one step ahead before making a choice, it can find better paths.
Evaluate the candidates. Each candidate test segment c ∈ C is evaluated by the
following cost function 3
The last term Dθ (m, c) ∈ [0, 1] measures the difference in orientation between m and
c, normalized by π.
The other terms consider the location of c in the context of test segments matched
so far, and compare it to the location of m within the matched model segments. The
first such spatial relation is
1 X
−→i , −
→
Dla (m, c, P) = Dθ (−
mm cti )
|P|
{mi ,ti }∈P
the average difference in direction between vectors −−→i going from m’s center to the
mm
−→
centers of matched model segments mi , and corresponding vectors cti going from c
to the matched test segments ti (see figure 4d). The second relation is analogous, but
focuses on the distances between segments
1 X
σk−
−→i k − k−
→
Dld (m, c, P) = mm cti k
σdm |P|
{mi ,ti }∈P
where dm is the diagonal of the model’s bounding-box, and hence σdm is a normal-
ization factor adapted to the current scale change estimate σ. Thus, all three terms of
function (1) are scale invariant.
The proposed cost function grows smoothly as the model transformation departs
from a pure pose change. In particular the Dla term captures the structure of the spatial
arrangements, while still allowing for considerable shape variation. Function (1) is low
when c is located and oriented in a similar way as m, in the context of the rest of the
shape matched so far. Hence, it guides the algorithm towards a path of test segments
with an overall shape similar to the model.
3
In all experiments, the weights are wla = 0.7, wld = 0.15, wθ = 1 − wla − wld = 0.15.
2 1
cbest 3 bm
2 1 4
bt bt
c
3
m m1
4
m2
a b d
Fig. 4. Basic matching. (a) Iteration 1: basis segment bt , candidates C with qc ≤ 0.3 (black
thin), and best candidate cbest (thick). (b) Matched path P after iteration 4. (c) Model, with basis
segment bm and segments matched at iteration 1-4 labeled. (d) Example vectors used in Dla , Dld .
Analyzing the values of qc over many test cases reveals that for most correct can-
didates qc < 0.15. In order to prevent the algorithm from deviating over a grossly
incorrect path when no plausible candidate is available, we discard all candidates with
qc above the loose threshold qth = 0.3. Hence: C ← {c|qc ≤ qth }.
In addition to the geometric quality qc of a retained candidate c, we also consider
its relevance, in terms of the average strength of its edgels ▽c ∈ [0, 1]. Hence, we set
the overall cost of c to qc · (1 − ▽c ). Experiments show a marked improvement over
treating edgels as binary features, when consistently exploiting edge strength here and
in the path evaluation score (next section).
Update σ. After extending P the scale change σ is re-estimated as follows. Let δm
be the average distance between pairs of edgels along the model segments, and δt be
the corresponding distance for the test segments. Then, set σ = δδmt . This estimation
considers the relative locations of the segments, together with their individual transfor-
mations, and is robust to mismatched segments within a correct path (unlike simpler
measures such as deriving σ from the bounding-box areas). Thanks to this step, σ is
continuously adapted to the growing path of segments, which is useful for computing
Dld when matching segments distant from the basis match. Due to shape variability and
detection inaccuracies, the scale change induced by a single segment holds only locally.
Properties. The basic matching algorithm has several attractive properties, due to op-
erating on the Contour Segment Network. First and foremost, at every iteration it must
chose among only a few candidates (about 4 on average), because only segments con-
nected to the previous one are considered. Since it meets only few distractors, it is likely
to make the right choices and thus find the object even in substantially cluttered images.
The systematic exploitation of connectedness is the key driving force of our system. It
keeps the average number of candidates D low, and independent of the total number of
test segments T . As another consequence, the computational complexity for processing
all basis matches is O(T M D log2 (M )), with M the number of model segments. In
contrast to “local search” [4] and “interpretation trees” [11], this is linear in T , mak-
ing it possible to process images with a very large number of clutter segments (even
thousands). Second, the spatial relations used in Dla , Dld can easily be pre-computed
for all possible segment pairs. During basic matching, evaluating a candidate takes but
a few operations, making the whole algorithm computationally efficient. In our Matlab
implementation, it takes only 10 seconds on average to process the approximately 1000
basis matches occurring when matching a model to a typical test image. Third, thanks
to the careful construction of the network, there is no need for the object contour to be
fully or cleanly detected. Instead, it can be interrupted at several points, short parts can
be missing, and it can be intertwined with clutter contours.
6 Hypothesis integration
Basic matching produces a large set H = {Pi } of matched paths Pi , termed hypothe-
ses. Since there are several correct basis matches to start from along the object contour,
there are typically several correct hypotheses on an object instance (figure 5b+c+d). In
this section we group hypotheses likely to belong to the same object instance, and fuse
them in a single integrated hypothesis. This brings two important advantages. First,
hypotheses matching different parts of the same model contour chain, are combined
into a single, more complete contour. The same holds for hypotheses covering different
model chains, which would otherwise remain disjoint (figure 5d). Second, the presence
of (partially) repeated hypotheses is a valuable indication of their correctness (i.e. that
they cover an object instance and not clutter). Since the basic matcher prefers the cor-
rect path over others, it produces similar hypotheses when starting from different points
along a correct path (figure 5b+c). Clutter paths instead, grow much more randomly.
Hence, hypothesis integration can accumulate the evidence brought by overlapping hy-
potheses, thereby separating them better from clutter.
Before proceeding with the hypothesis integration stage, we evaluate the quality of
each hypothesis P ∈ H. Each segment match {m, t} ∈ P is evaluated with respect to
the others using function (1): q (m, t, P\{m, t}). Whereas during basic matching only
segments matched before were available as reference, here we evaluate {m, t} in the
context of the entire path. The score of {m, t} is now naturally defined by setting the
maximum value qth of q as roof: qth − q (m, t, P\{m, t}). Finally, the total score of
P is the sum over the component matches’ scores, weighed by their relevance (edgel
strength ▽)
1 X
φ(P) = ▽t · (qth − q (m, t, P\{m, t}))
qth
{m,t}∈P
the normalization by q1th makes φ range in [0, |P|]. In order to reduce noise and speedup
further processing, we discard obvious garbage hypotheses, scoring below a low thresh-
old φth = 1.5: H ← {P|φ(P) ≥ φth }.
Hypothesis integration consists of the following two phases:
Grouping phase.
1. Let A be a graph with nodes the hypotheses H, and arcs (Pi , Pj ) weighed by the
(in-)compatibility csim between the pose transformations of Pi , Pj : csim (Pi , Pj ) =
1
2 (c(Pi , Pj ) + c(Pj , Pi )) , with
|ti − tj | σi σj
c(Pi , Pj ) = · max ,
d m σi σj σi
a b c d e
Fig. 5. Hypothesis integration. a) mug model, composed of an outer and an inner chain (hole). b-
d) 3 out of 14 hypotheses in a group. b) and c) are very similar, and arise from two different basis
matches along the outer model chain. Instead, d) covers the mug’s hole. e) All 14 hypothesis are
fused into a complete integrated hypothesis. Thanks to evidence accumulation, its score (28.6) is
much higher than that of individual hypotheses (b scores 2.8). Note the important variations of
the mug’s shape w.r.t the model.
The first factor measures the translation mismatch, normalized by the scale change
σ, while the second factor accounts for the scale mismatch.
2. Partition A using the Clique Partitioning algorithm proposed by [9]. Each resulting
group contains hypotheses with similar pose transformations. The crux is that a
group contains either hypotheses likely to belong to the same object instance, or
some clutter hypotheses. Mixed groups are rare.
Integration phase. We now combine the hypotheses within each group G ⊂ A into a
single integrated hypothesis.
1. Let the central hypothesis Pc of G be the one maximizing
X \
φ(Pi ) · Pi Pj · φ(Pj )
Pj ∈{G\Pi }
T
where |Pi Pj | is the number of segment matches present in both Pi and Pj .
The central hypothesis best combines the features of having a good score and being
similar to the others. Hence, it is the best representative of the group. Note how
the selection of Pc is stable w.r.t. fluctuations of the scores, and robust to clutter
hypotheses which occasionally slip into a correct group.
2. Initialize the integrated hypothesis as Gint = Pc , and add the hypothesis B resulting
in the highest combined score φ(Gint ). This means adding the parts of B that match
model segments unexplained by Gint (figure 5d, with initial Gint in 5b). Iteratively
add hypotheses until φ(Gint ) increases no further.
3. Score the integrated hypothesis by taking into account repetitions within the group,
so as to accumulate the evidence for its correctness. φ(Gint ) is updated by multi-
plying the component matches’ scores by the number of times they are repeated.
Evidence accumulation raises the scores of correct integrated hypotheses, thus im-
proving their separation from false-positives.
In addition to assembling partial hypotheses into complete contours and accumulating
evidence, the hypothesis integration stage also enables the detection of multiple object
instances in the same test image (delivered as separate integrated hypotheses). More-
over, the computational cost is low (1-2 seconds on average).
The integrated hypotheses Gint are the final output of the system (called detections).
In case of multiple detections on the same image location, we keep only the one with
the highest score.
References
1. R. Basri, L. Costa, D. Geiger, D. Jacobs, Determining the Similarity of Deformable Shapes,
Vision Research, 1998.
2. S. Belongie, J. Malik, J. Puzicha, Shape Matching and Object Recognition Using Shape
Contexts, PAMI, 24:4, 2002.
3. A. Berg, T. Berg and J. Malik, Shape Matching and Object Recognition using Low Distortion
Correspondence, CVPR, 2005.
4. J. R. Beveridge and E. M. Riseman, How Easy is Matching 2D Line Models Using Local
Search?, PAMI, 19:6, 1997
5. A. Del Bimbo, P. Pala, Visual Image Retrieval by Elastic Matching of User Sketches, PAMI,
19:2, 1997.
6. T. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, Active Shape Models - Their Training
and Application, CVIU, 61:1, 1995.
7. D. Cremers, C. Schnorr, and J. Weickert, Diffusion-Snakes: Combining Statistical Shape
Knowledge and Image Information in a Variational Framework, Workshop on Variational
and Levelset Methods, 2001.
8. P. F. Felzenszwalb, Representation and Detection of Deformable Shapes, CVPR’03
9. V. Ferrari, T. Tuytelaars, and L. Van Gool, Real-time Affine Region Tracking and Coplanar
Grouping, CVPR, 2001.
10. D. Gavrila, V. Philomin, Real-time Object Detection for Smart Vehicles, ICCV’99
11. W. Grimson, T. Lozano-Perez, Localizing Overlapping Parts by Searching the Interpretation
Tree, PAMI, 9:4, 1987.
12. D. Jacobs, Robust and Efficient Detection of Convex Groups, PAMI, 18:1, 1996.
13. B. Leibe, B. Schiele, Pedestrian detection in crowded scenes, CVPR, 2005.
14. D. Lowe, T. Binford, Perceptual organization as a basis for visual recognition, AAAI, 1983
15. D. Martin, C. Fowlkes and J. Malik, Learning to detect natural image boundaries using local
brightness, color, and texture cues, PAMI, 26(5):530-549, 2004.
16. A. Selinger, R. Nelson, A Cubist approach to Object Recognition, ICCV, 1998.
17. A. Thayananthan, B. Stenger, P. Torr, R. Cipolla, Shape Context and Chamfer Matching in
Cluttered Scenes, CVPR, 2003.
1 2 3
1 1
0.9 0.9
0.8 0.8
0.7 0.7
Detection rate
Detection rate
0.6 0.6
i 0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.25 0.5 0.75 1 1.25 1.5 0 0.25 0.5 0.75 1 1.25 1.5
False−positives per image False−positives per image
1 1 1
Detection rate
Detection rate
0 0 0
0 0.25 0.5 0.75 1 1.25 1.5 0 0.25 0.5 0.75 1 1.25 1.5 0 0.25 0.5 0.75 1 1.25 1.5
False−positives per image False−positives per image False−positives per image