Download

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/226359900
Contour and Texture Analysis for Image Segmentation
Article in International Journal of Computer Vision · June 2001

DOI: 10.1023/A:1011174803800 · Source: CiteSeer
CITATIONS READS
1,166 2,135
4 authors, including:
Jitendra Malik Serge Belongie

University of California, Berkeley Cornell University
269 PUBLICATIONS 113,411 CITATIONS 338 PUBLICATIONS 142,506 CITATIONS
SEE PROFILE SEE PROFILE
Thomas Leung
Google Inc.
42 PUBLICATIONS 19,983 CITATIONS
SEE PROFILE
All content following this page was uploaded by Serge Belongie on 21 May 2014.
The user has requested enhancement of the downloaded file.

International Journal of Computer Vision 43(1), 7–27, 2001

c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.
Contour and Texture Analysis for Image Segmentation
JITENDRA MALIK, SERGE BELONGIE, THOMAS LEUNG∗ AND JIANBO SHI†

Computer Science Division, University of California at Berkeley, Berkeley, CA 94720-1776, USA
Received December 28, 1999; Revised February 23, 2001; Accepted February 23, 2001
Abstract. This paper provides an algorithm for partitioning grayscale images into disjoint regions of coherent
brightness and texture. Natural images contain both textured and untextured regions, so the cues of contour and
texture differences are exploited simultaneously. Contours are treated in the intervening contour framework, while
texture is analyzed using textons. Each of these cues has a domain of applicability, so to facilitate cue combination we
introduce a gating operator based on the texturedness of the neighborhood at a pixel. Having obtained a local measure
of how likely two nearby pixels are to belong to the same region, we use the spectral graph theoretic framework of
normalized cuts to find partitions of the image into regions of coherent texture and brightness. Experimental results
on a wide range of images are shown.
Keywords: segmentation, texture, grouping, cue integration, texton, normalized cut
1. Introduction age properties such as brightness, color and texture.

Contour-based approaches usually start with a first
To humans, an image is not just a random collection stage of edge detection, followed by a linking process
of pixels; it is a meaningful arrangement of regions that seeks to exploit curvilinear continuity.
and objects. Figure 1 shows a variety of images. De- These two approaches need not be that different from
spite the large variations of these images, humans have each other. Boundaries of regions can be defined to be
no problem interpreting them. We can agree about the contours. If one enforces closure in a contour-based
different regions in the images and recognize the differ- framework (Elder and Zucker, 1996; Jacobs, 1996)
ent objects. Human visual grouping was studied exten- then one can get regions from a contour-based ap-
sively by the Gestalt psychologists in the early part of proach. The difference is more one of emphasis and
the 20th century (Wertheimer, 1938). They identified what grouping factor is coded more naturally in a given
several factors that lead to human perceptual group- framework.
ing: similarity, proximity, continuity, symmetry, par- A second dimension on which approaches can
allelism, closure and familiarity. In computer vision, be compared is local vs. global. Early techniques,
these factors have been used as guidelines for many in both contour and region frameworks, made local
grouping algorithms. decisions—in the contour framework this might be
The most studied version of grouping in computer vi- declaring an edge at a pixel with high gradient, in the
sion is image segmentation. Image segmentation tech- region framework this might be making a merge/split
niques can be classified into two broad families— decision based on a local, greedy strategy.
(1) region-based, and (2) contour-based approaches. Region-based techniques lend themselves more
Region-based approaches try to find partitions of the readily to defining a global objective function (for
image pixels into sets corresponding to coherent im- example, Markov random fields (Geman and Ge-
man, 1984) or variational formulations (Mumford and
∗ Present address: Compaq Cambridge Research Laboratory. Shah, 1989)). The advantage of having a global ob-
† Present address: Robotics Institute, Carnegie Mellon University. jective function is that decisions are made only when
8 Malik et al.
Figure 1. Some challenging images for a segmentation algorithm. Our goal is to develop a single grouping procedure which can deal with all
these types of images.
information from the whole image is taken into account it results in a meaningless tangled web of contours.
at the same time. Think for instance of what an edge detector would re-
In contour-based approaches, often the first step of turn on the snow and rock region in Fig. 1(a). The
edge detection is done locally. Subsequently efforts are traditional “solution” for this problem in edge detec-
made to improve results by a global linking process that tion is to use a high threshold so as to minimize the
seeks to exploit curvilinear continuity. Examples in- number of edges found in the texture area. This is ob-
clude dynamic programming (Montanari, 1971), relax- viously a non-solution—such an approach means that
ation approaches (Parent and Zucker, 1989), saliency low-contrast extended contours will be missed as well.
networks (Sha’ashua and Ullman, 1988), stochastic This problem is illustrated in Fig. 2. There is no recog-
completion (Williams and Jacobs, 1995). A criticism nition of the fact that extended contours, even weak in
of this approach is that the edge/no edge decision is contrast, are perceptually significant.
made prematurely. To detect extended contours of very While the perils of using edge detection in textured
low contrast, a very low threshold has to be set for the regions have been noted before (see e.g. Binford, 1981),
edge detector. This will cause random edge segments a complementary problem of contours constituting a
to be found everywhere in the image, making the task problem for texture analysis does not seem to have been
of the curvilinear linking process unnecessarily harder recognized before. Typical approaches are based on
than if the raw contrast information was used. measuring texture descriptors over local windows, and
A third dimension on which various segmentation then computing differences between window descrip-
schemes can be compared is the class of images for tors centered at different locations. Boundaries can then
which they are applicable. As suggested by Fig. 1, we give rise to thin strip-like regions, as in Fig. 3. For speci-
have to deal with images which have both textured and ficity, assume that the texture descriptor is a histogram
untextured regions. Here boundaries must be found us- of linear filter outputs computed over a window. Any
ing both contour and texture analysis. However what histogram window near the boundary of the two regions
we find in the literature are approaches which concen- will contain large filter responses from filters oriented
trate on one or the other. along the direction of the edge. However, on both sides
Contour analysis (e.g. edge detection) may be ade- of the boundary, the histogram will indicate a feature-
quate for untextured images, but in a textured region less region. A segmentation algorithm based on, say, χ 2
Contour and Texture Analysis 9
Figure 2. Demonstration of texture as a problem for the contour process. Each image shows the edges found with a Canny edge detector for the
penguin image using different scales and thresholds: (a) fine scale, low threshold, (b) fine scale, high threshold, (c) coarse scale, low threshold,
(d) coarse scale, high threshold. A parameter setting that preserves the correct edges while suppressing spurious detections in the textured area
is not possible.
Figure 3. Demonstration of the “contour-as-a-texture” problem using a real image. (a) Original image of a bald eagle. (b) The groups found
by an EM-based algorithm (Belongie et al., 1998).
distances between histograms, will inevitably partition framework, so that the cues of contour and texture
the boundary as a group of its own. As is evident, the differences can be simultaneously exploited.
problem is not confined to the use of a histogram of fil- 2. In terms of contour, the approach should be able
ter outputs as texture descriptor. Figure 3(b) shows the to deal with boundaries defined by brightness step
actual groups found by an EM-based algorithm using edges as well as lines (as in a cartoon sketch).
an alternative color/texture descriptor (Belongie et al., 3. Image regions could contain texture which could be
1998). regular such as the polka dots in Fig. 1(c), stochastic
as in the snow and rock region in (a) or anywhere
1.1. Desiderata of a Theory of Image Segmentation in between such as the tiger stripes in (b). A key
question here is that one needs an automatic pro-
At this stage, we are ready to summarize our desired cedure for scale selection. Whatever one’s choice
attributes for a theory of image segmentation. of texture descriptor, it has to be computed over a
local window whose size and shape need to be de-
1. It should deal with general images. Regions with termined adaptively. What makes scale selection a
or without texture should be processed in the same challenge is that the technique must deal with the
10 Malik et al.
wide range of textures—regular, stochastic, or in- enable us to address the key problems in a very natural
termediate cases—in a seamless way. fashion.
1.2. Introducing Textons 1.3. Summary of Our Approach
Julesz introduced the term texton, analogous to a We pursue image segmentation in the framework of
phoneme in speech recognition, nearly 20 years ago Normalized Cuts introduced by Shi and Malik (1997,
(Julesz, 1981) as the putative units of preattentive hu- 2000). The image is considered to be a weighted graph
man texture perception. He described them qualita- where the nodes i and j are pixels and edge weights,
tively for simple binary line segment stimuli—oriented Wij , denote a local measure of similarity between the
segments, crossings and terminators—but did not pro- two pixels. Grouping is performed by finding eigenvec-
vide an operational definition for gray-level images. tors of the Normalized Laplacian of this graph (§3). The
Subsequently, texton theory fell into disfavor as a fundamental issue then is that of specifying the edge
model of human texture discrimination as accounts weights Wij ; we rely on normalized cuts to go from
based on spatial filtering with orientation and scale- these local measures to a globally optimal partition of
selective mechanisms that could be applied to arbitrary the image.
gray-level images became popular. The algorithm analyzes the image using the two cues
There is a fundamental, well recognized, problem of contour and texture. The local similarity measure
with linear filters. Generically, they respond to any between pixels i and j due to the contour cue, WijIC ,
stimulus. Just because you have a response to an ori- is computed in the intervening contour framework of
ented odd-symmetric filter doesn’t mean there is an Leung and Malik (1998) using peaks in contour ori-
edge at that location. It could be that there is a higher entation energy (§2 and §4.1). Texture is analysed us-
contrast bar at some other location in a different orien- ing textons (§2.1). Appropriate local scale is estimated
tation which has caused this response. Tokens such as from the texton labels. A histogram of texton densi-
edges or bars or corners can not be associated with the ties is used as the texture descriptor. Similarity, WijTX ,
output of a single filter. Rather it is the signature of the is measured using the χ 2 test on the histograms (§4.2).
outputs over scales, orientations and order of the filter The edge weights Wij combining both contour and tex-
that is more revealing. ture information are specified by gating each of the two
Here we introduce a further step by focussing on the cues with a texturedness measure (§4.3).
outputs of these filters considered as points in a high In (§5), we present the practical details of going from
dimensional space (on the order of 40 filters are used). the eigenvectors of the normalized Laplacian matrix of
We perform vector quantization, or clustering, in this the graph to a partition of the image. Results from the
high-dimensional space to find prototypes. Call these algorithm are presented in (§6). Some of the results
prototypes textons—we will find empirically that these presented here were published in Malik et al. (1999).
tend to correspond to oriented bars, terminators and so
on. One can construct a universal texton vocabulary
by processing a large number of natural images, or 2. Filters, Composite Edgels, and Textons
we could find them adaptively in windows of images.
In each case the K -means technique can be used. By Since the 1980s, many approaches have been proposed
mapping each pixel to the texton nearest to its vector of in the computer vision literature that start by convolv-
filter responses, the image can be analyzed into texton ing the image with a bank of linear spatial filters f i
channels, each of which is a point set. tuned to various orientation and spatial frequencies
It is our opinion that the analysis of an image into tex- (Knutsson and Granlund, 1983; Koenderink and van
tons will prove useful for a wide variety of visual pro- Doorn, 1987; Fogel and Sagi, 1989; Malik and Perona,
cessing tasks. For instance, in Leung and Malik (1999) 1990). (See Fig. 4 for an example of such a filter
we use the related notion of 3D textons for recognition set.)
of textured materials. In the present paper, our objec- These approaches were inspired by models of pro-
tive is to develop an algorithm for the segmentation cessing in the early stages of the primate visual system
of an image into regions of coherent brightness and (e.g. DeValois and DeValois, 1988). The filter kernels
texture—we will find that the texton representation will f i are models of receptive fields of simple cells in visual
Figure 4. Left: Filter set f i consisting of 2 phases (even and odd), 3 scales (spaced by half-octaves), and 6 orientations (equally spaced from
0 to π). The basic filter is a difference-of-Gaussian quadrature pair with 3 : 1 elongation. Right: 4 scales of center-surround filters. Each filter is
L 1 -normalized for scale invariance.
cortex. To a first approximation, we can classify them Hilbert transform of f 1 (x, y) along the y axis:
into three categories:
2 2
d2 1 y x
f 1 (x, y) = exp 2 exp 2 2
1. Cells with radially symmetric receptive fields. The dy 2 C σ σ
usual choice of f i is a Difference of Gaussians f 2 (x, y) = Hilbert( f 1 (x, y))
(DOG) with the two Gaussians having different val-
ues of σ . Alternatively, these receptive fields can
also be modeled as the Laplacian of Gaussian. where σ is the scale, is the aspect ratio of the fil-
2. Oriented odd-symmetric cells whose receptive ter, and C is a normalization constant. (The use of the
fields can be modeled as rotated copies of a hor- Hilbert transform instead of a first derivative makes f 1
izontal oddsymmetric receptive field. A suitable and f 2 an exact quadrature pair.) The radially symmet-
point spread function for such a receptive field is ric portion of the filterbank consists of Difference-of-
f (x, y) = G σ1 (y)G σ2 (x) where G σ (x) represents Gaussian kernels. Each filter is zero-mean and L 1 nor-
a Gaussian with standard deviation σ . The ratio malized for scale invariance (Malik and Perona, 1990).
σ2 : σ1 is a measure of the elongation of the fil- Now suppose that the image is convolved with such
ter. a bank of linear filters. We will refer to the collection of
3. Oriented even-symmetric cells whose receptive response images I ∗ f i as the hypercolumn transform
fields can be modeled as rotated copies of a horizon- of the image.
tal evensymmetric receptive field. A suitable point Why is this useful from a computational point of
spread function for such a receptive field is view? The vector of filter outputs I ∗ f i (x0 , y0 ) char-
acterizes the image patch centered at x0 , y0 by a set
of values at a point. This is similar to characterizing
f (x, y) = G σ1 (y)G σ2 (x) an analytic function by its derivatives at a point—one
can use a Taylor series approximation to find the val-
The use of Gaussian derivatives (or equivalently, dif- ues of the function at neighboring points. As pointed
ferences of offset Gaussians) for modeling receptive out by Koenderink and van Doorn (1987), this is more
fields of simple cells is due to Young (1985). One could than an analogy, because of the commutativity of the
equivalently use Gabor functions. Our preference for operations of differentiation and convolution, the re-
Gaussian derivatives is based on their computational ceptive fields described above are in fact computing
simplicity and their natural interpretation as ‘blurred ‘blurred derivatives’. We recommend Koenderink and
derivatives’ (Koenderink and van Doorn, 1987, 1988). van Doorn (1987, 1988), Jones and Malik (1992), and
The oriented filterbank used in this work, depicted Malik and Perona (1992) for a discussion of other ad-
in Fig. 4, is based on rotated copies of a Gaussian vantages of such a representation.
derivative and its Hilbert transform. More precisely, The hypercolumn transform provides a convenient
let f 1 (x, y) = G σ1 (y)G σ2 (x) and f 2 (x, y) equal the front end for contour and texture analysis:
12 Malik et al.
– Contour. In computational vision, it is customary The idea is that for any contour with OE∗ σIC ,
to model brightness edges as step edges and to de- pcon ≈ 1.
tect them by marking locations corresponding to – Texture. As the hypercolumn transform provides a
the maxima of the outputs of odd-symmetric filters good local descriptor of image patches, the bound-
(e.g. Canny, 1986) at appropriate scales. However, ary between differently textured regions may be
it should be noted that step edges are an inadequate found by detecting curves across which there is a
model for the discontinuities in the image that re- significant gradient in one or more of the compo-
sult from the projection of depth or orientation dis- nents of the hypercolumn transform. For an elab-
continuities in physical scene. Mutual illumination oration of this approach, see Malik and Perona
and specularities are quite common and their ef- (1990).
fects are particularly significant in the neighbor- Malik and Perona relied on averaging with large
hood of convex or concave object edges. In addi- kernels to smooth away spatial variation for filter
tion, there will typically be a shading gradient on responses within regions of texture. This process
the image regions bordering the edge. As a conse- loses a lot of information about the distribution of
quence of these effects, real image edges are not filter responses; a much better method is to rep-
step functions but more typically a combination of resent the neighborhood around a pixel by a his-
steps, peak and roof profiles. As was pointed out togram of filter outputs (Heeger and Bergen, 1995;
in Perona and Malik (1990), the oriented energy Puzicha et al., 1997). While this has been shown to
approach (Knutsson and Granlund, 1983; Morrone be a powerful technique, it leaves open two impor-
and Owens, 1987; Morrone and Burr, 1988) can be tant questions. Firstly, there is the matter of what
used to detect and localize correctly these compos- size window to use for pooling the histogram—the
ite edges. integration scale. Secondly, these approaches only
The oriented energy, also known as the “quadra- make use of marginal binning, thereby missing out
ture energy,” at angle 0◦ is defined as: on the informative characteristics that joint assem-
blies of filter outputs exhibit at points of interest.
OE0◦ = (I ∗ f 1 )2 + (I ∗ f 2 )2 We address each of these questions in the following
section.
OE0◦ has maximum response for horizontal con-
tours. Rotated copies of the two filter kernels are
able to pick up composite edge contrast at various 2.1. Textons
orientations.
Given OEθ , we can proceed to localize the com- Though the representation of textures using filter re-
posite edge elements (edgels) using oriented non- sponses is extremely versatile, one might say that it is
maximal suppression. This is done for each scale overly redundant (each pixel value is represented by
in the following way. At a generic pixel q, let Nfil real-valued filter responses, where Nfil is 40 for our
θ ∗ = arg max OEθ denote the dominant orientation particular filter set). Moreover, it should be noted that
and OE∗ the corresponding energy. Now look at we are characterizing textures, entities with some spa-
the two neighboring values of OE∗ on either side tially repeating properties by definition. Therefore, we
of q along the line through q perpendicular to the do not expect the filter responses to be totally differ-
dominant orientation. The value OE∗ is kept at the ent at each pixel over the texture. Thus, there should
location of q only if it is greater than or equal to be several distinct filter response vectors and all others
each of the neighboring values. Otherwise it is re- are noisy variations of them.
placed with a value of zero. This observation leads to our proposal of cluster-
Noting that OE∗ ranges between 0 and infinity, ing the filter responses into a small set of prototype
we convert it to a probability-like number between response vectors. We call these prototypes textons. Al-
0 and 1 as follows: gorithmically, each texture is analyzed using the filter
bank shown in Fig. 4. Each pixel is now transformed
pcon = 1 − exp(−OE∗/σIC ) (1) to a Nfil dimensional vector of filter responses. These
vectors are clustered using K -means. The criterion for
σIC is related to oriented energy response purely this algorithm is to find K “centers” such that after as-
due to image noise. We use σ = 0.02 in this paper. signing each data vector to the nearest center, the sum
Figure 5. (a) Polka-dot image. (b) Textons found via K -means with K = 25, sorted in decreasing order by norm. (c) Mapping of pixels to the
texton channels. The dominant structures captured by the textons are translated versions of the dark spots. We also see textons corresponding
to faint oriented edge and bar elements. Notice that some channels contain activity inside a textured region or along an oriented contour and
nowhere else.
of the squared distance from the centers is minimized. resenting the filterbank is formed by concatenating the
K -means is a greedy algorithm that finds a local mini- filter kernels into columns and placing these columns
mum of this criterion.1 side by side. The set of synthesized image patches for
It is useful to visualize the resulting cluster centers two test images are shown in Figs. 5(b) and 6(b). These
in terms of the original filter kernels. To do this, recall are our textons. The textons represent assemblies of
that each cluster center represents a set of projections of filter outputs that are characteristic of the local image
each filter onto a particular image patch. We can solve structure present in the image.
for the image patch corresponding to each cluster center Looking at the polka-dot example, we find that many
in a least squares sense by premultiplying the vectors of the textons correspond to translated versions of dark
representing the cluster centers by the pseudoinverse of spots.2 Also included are a number of oriented edge
the filterbank (Jones and Malik, 1992). The matrix rep- elements of low contrast and two textons representing
14 Malik et al.
Figure 6. (a) Penguin image. (b) Textons found via K -means with K = 25, sorted in decreasing order by norm. (c) Mapping of pixels to the
texton channels. Among the textons we see edge elements of varying orientation and contrast along with elements of the stochastic texture in
the rocks.
nearly uniform brightness. The pixel-to-texton map- to the next. The spatial characteristics of both the de-
ping is shown in Fig. 5(c). Each subimage shows the terministic polka dot texture and the stochastic rocks
pixels in the image that are mapped to the correspond- texture are captured across several texton channels. In
ing texton in Fig. 5(b). We refer to this collection of general, the texture boundaries emerge as point density
discrete point sets as the texton channels. Since each changes across the different texton channels. In some
pixel is mapped to exactly one texton, the texton chan- cases, a texton channel contains activity inside a par-
nels constitute a partition of the image. ticular textured region and nowhere else. By compari-
Textons and texton channels are also shown for the son, vectors of filter outputs generically respond with
penguin image in Fig. 6. Notice in the two examples some value at every pixel—a considerably less clean
how much the texton set can change from one image alternative.
We have not been particularly sophisticated in the between them to be a measure of scale. Of course, many
choice of K , the number of different textons for a given textures are stochastic and detecting texels reliably is
image. How to choose an optimal value of K in K - hard even for regular textures.
means has been the subject of much research in the With textons we have a “soft” way to define neigh-
model selection and clustering literature; we used a bors. For a given pixel in a texton channel, first con-
fixed choice K = 36 to obtain the segmentation results sider it as a “thickened point”— a disk centered at it.3
in this paper. Clearly, if the images vary considerably in The idea is that while textons are being associated with
complexity and number of objects in them, an adaptive pixels, since they correspond to assemblies of filter out-
choice may give better results. puts, it is better to think of them as corresponding to
The mapping from pixel to texton channel provides a small image disk defined by the scale used in the
us with a number of discrete point sets where before Gaussian derivative filters. Recall Koenderink’s apho-
we had continuous-valued filter vectors. Such a repre- rism about a point in image analysis being a Gaussian
sentation is well suited to the application of techniques blob of small σ !
from computational geometry and point process statis- Now consider the Delaunay neighbors of all the pix-
tics. With these tools, one can approach questions such els in the thickened point of a pixel i which lie closer
as, “what is the neighborhood of a texture element?” than some outer scale.4 The intuition is that these will
and “how similar are two pixels inside a textured be pixels in spatially neighboring texels. Compute the
region?” distances of all these pixels to i; the median of these
Several previous researchers have employed cluster- constitutes a robust local measure of inter-texel dis-
ing using K -means or vector quantization as a stage in tance. We define the local scale α(i) to be 1.5 times
their approach to texture classification—two represen- this median distance.
tative examples are McLean (1993) and Raghu et al. In Fig. 7(a), the Delaunay triangulation of a zoomed-
(1997). What is novel about our approach is the identi- in portion of one of the texton channels in the polka-dot
fication of clusters of vectors of filter outputs with the dress of Fig. 5(a) is shown atop a brightened version
Julesz notion of textons. Then first order statistics of of the image. Here the nodes represent points that are
textons are used for texture characterization, and the similar in the image while the edges provide proximity
spatial structure within texton channels enables scale information.
estimation. Vector quantization becomes much more The local scale α(i) is based just on the texton chan-
than just a data compression or coding step. The next nel for the texton at i. Since neighboring pixels should
subsection should make this point clear. have similar scale and could be drawn from other tex-
ton channels, we can improve the estimate of scale by
2.1.1. Local Scale and Neighborhood Selection. The median filtering of the scale image.
texton channel representation provides us a natural way
to define texture scale. If the texture is composed of dis- 2.1.2. Computing Windowed Texton Histograms.
crete elements(“texels”), we might want to define a no- Pairwise texture similarities will be computed by com-
tion of texel neighbors and consider the mean distance paring windowed texton histograms. We define the
Figure 7. Illustration of scale selection. (a) Closeup of Delaunay triangulation of pixels in a particular texton channel for polka dot image. (b)
Neighbors of thickened point for pixel at center. The thickened point lies within inner circle. Neighbors are restricted to lie within outer circle.
(c) Selected scale based on median of neighbor edge lengths, shown by circle, with all pixels falling inside circle marked with dots.
16 Malik et al.
window W(i) for a generic pixel i as the axis-aligned For more discussion of this criterion, please refer to Shi
square of radius α(i) centered on pixel i. and Malik (2000).
Each histogram has K bins, one for each texton chan- One key advantage of using the normalized cut is that
nel. The value of the kth histogram bin for a pixel i is a good approximation to the optimal partition can be
found by counting how many pixels in texton channel k computed very efficiently.5 Let W be the association
fall inside the window W(i). Thus the histogram rep- matrix, i.e. Wij is the weight between nodes i and j
resents texton frequencies in a local neighborhood. We in the
graph. Let D be the diagonal matrix such that
can write this as Dii = j Wij , i.e. Dii is the sum of the weights of all
the connections to node i. Shi and Malik showed that
h i (k) = I [T ( j) = k] (2) the optimal partition can be found by computing:
j∈W(i)
y = arg min Ncut
where I [·] is the indicator function and T ( j) returns yT (D − W)y
= arg min (3)
the texton assigned to pixel j. y yT Dy
where y = {a, b} N is a binary indicator vector speci-
3. The Normalized Cut Framework fying the group identity for each pixel, i.e. yi = a if
pixel i belongs to group A and y j = b if pixel j belongs
In the Normalized Cut framework (Shi and Malik, to B. N is the number of pixels. Notice that the above
1997, 2000), which is inspired by spectral graph theory expression is a Rayleigh quotient. If we relax y to take
(Chung, 1997), Shi and Malik formulate visual group- on real values (instead of two discrete values), we can
ing as a graph partitioning problem. The nodes of the optimize Eq. (3) by solving a generalized eigenvalue
graph are the entities that we want to partition; for ex- system. Efficient algorithms with polynomial running
ample, in image segmentation, they are the pixels. The time are well-known for solving such problems.
edges between two nodes correspond to the strength The process of transforming the vector y into a dis-
with which these two nodes belong to one group; again, crete bipartition and the generalization to more than
in image segmentation, the edges of the graph corre- two groups is discussed in (§5).
spond to how much two pixels agree in brightness,
color, etc. Intuitively, the criterion for partitioning the
4. Defining the Weights
graph will be to minimize the sum of weights of con-
nections across the groups and maximize the sum of
The quality of a segmentation based on Normalized
weights of connections within the groups.
Cuts or any other algorithm based on pairwise sim-
Let G = {V, E} be a weighted undirected graph,
ilarities fundamentally depends on the weights—the
where V are the nodes and E are the edges. Let A, B
Wij ’s—that are provided as input. The weights should
be a partition of the graph: A ∪ B = V, A ∩ B = ∅. In
be large for pixels that should belong together and small
graph theoretic language, the similarity between these
otherwise. We now discuss our method for computing
two groups is called the cut:
the Wij ’s. Since we seek to combine evidence from
two cues, we will first discuss the computation of the
cut(A, B) = Wij weights for each cue in isolation, and then describe
i∈A, j∈B
how the two weights can be combined in a meaningful
fashion.
where Wij is the weight on the edge between nodes
i and j. Shi and Malik proposed to use a normalized
similarity criterion to evaluate a partition. They call it 4.1. Images Without Texture
the normalized cut:
Consider for the moment the “cracked earth” image in
cut(A, B) cut(B, A) Fig. 1(e). Such an image contains no texture and may be
N cut(A, B) = +
assoc(A, V) assoc(B, V) treated in a framework based solely on contour features.
The definition of the weights in this case, which we
where assoc(A, V) = i∈A,k∈V Wik is the total con- denote WijIC , is adopted from the intervening contour
nection from nodes in A to all the nodes in the graph. method introduced in Leung and Malik (1998).
Figure 8. Left: the original image. Middle: part of the image marked by the box. The intensity values at pixels p1 , p2 and p3 are similar.
However, there is a contour in the middle, which suggests that p1 and p2 belong to one group while p3 belongs to another. Just comparing
intensity values at these three locations will mistakenly suggest that they belong to the same group. Right: orientation energy. Somewhere along
l2 , the orientation energy is strong which correctly proposes that p1 and p3 belong to two different partitions, while orientation energy along l1
is weak throughout, which will support the hypothesis that p1 and p2 belong to the same group.
Figure 8 illustrates the intuition behind this idea. On 4.2. Images that are Texture Mosaics
the left is an image. The middle figure shows a mag-
nified part of the original image. On the right is the Now consider the case of images wherein all of the
orientation energy. There is an extended contour sep- boundaries arise from neighboring patches of different
arating p3 from p1 and p2 . Thus, we expect p1 to be texture (e.g. Fig. 1(d)). We compute pairwise texture
much more strongly related to p2 than p3 . This intuition similarities by comparing windowed texton histograms
carries over in our definition of dissimilarity between computed using the technique described previously
two pixels: if the orientation energy along the line be- (§2.1.2). A number of methods are available for com-
tween two pixels is strong, the dissimilarity between paring histograms. We use the χ 2 test, defined as
these pixels should be high (and Wij should be low).
Contour information in an image is computed 1 K
[h i (k) − h j (k)]2
χ 2 (h i , h j ) =
“softly” through orientation energy (OE) from elon- 2 k=1 h i (k) + h j (k)
gated quadrature filter pairs. We introduce a slight mod-
ification here to allow for exact sub-pixel localization where h i and h j are the two histograms. For an em-
of the contour by finding the local maxima in the orien- pirical comparison of the χ 2 test versus other texture
tation energy perpendicular to the contour orientation similarity measures, see Puzicha et al. (1997).
(Perona and Malik, 1990). The orientation energy gives WijTX is then defined as follows:
the confidence of this contour. WijIC is then defined as
follows: WijTX = exp(−χ 2 (h i , h j )/σTX ) (4)
WijIC = 1 − max pcon (x)
x∈Mij If histograms h i and h j are very different, χ 2 is large,
and the weight WijTX is small.
where Mij is the set of local maxima along the line join-
ing pixels i and j. Recall from (§2) that pcon (x), 0 <
pcon < 1, is nearly 1 whenever the orientated energy 4.3. General Images
maximum at x is sufficiently above the noise level. In
words, two pixels will have a weak link between them Finally we consider the general case of images that
if there is a strong local maximum of orientation energy contain boundaries of both kinds. This presents us with
along the line joining the two pixels. On the contrary, if the problem of cue integration. The obvious approach
there is little energy, for example in a constant bright- to cue integration is to define the weight between pixels
ness region, the link between the two pixels will be i and j as the product of the contribution from each
strong. Contours measured at different scales can be cue: Wij = WijIC × WijTX . The idea is that if either of
taken into account by computing the orientation en- the cues suggests that i and j should be separated,
ergy maxima at various scales and setting pcon to be the composite weight, Wij , should be small. We must
the maximum over all the scales at each pixel. be careful, however, to avoid the problems listed in the
18 Malik et al.
Introduction (§1) by suitably gating the cues. The spirit

of the gating method is to make each cue “harmless”
in locations where the other cue should be operating.
4.3.1. Estimating Texturedness. As illustrated in

Fig. 2, the fact that a pixel survives the non-maximum
suppression step does not necessarily mean that that
pixel lies on a region boundary. Consider a pixel inside
a patch of uniform texture: its oriented energy is large
but it does not lie on the boundary of a region. Con-
versely, consider a pixel lying between two uniform
patches of just slightly different brightness: it does lie
on a region boundary but its oriented energy is small.
In order to estimate the “probability” that a pixel lies Figure 9. Illustration of half windows used for the estimation of
the texturedness. The texturedness of a label is based on a χ 2 test
on a boundary, it is necessary to take more surround-
on the textons in the two sides of a box as shown above for two
ing information into account. Clearly the true value sample pixels. The size and orientation of the box is determined by
of this probability is only determined after the final the selected scale and dominant orientation for the pixel at center.
correct segmentation, which is what we seek to find. Within the rocky area, the texton statistics are very similar, leading
At this stage our goal is to formulate a local estimate to a low χ 2 value. On the edge of the wing, the χ 2 value is relatively
high due to the dissimilarity of the textons that fire on either side of
of the texturedness of the region surrounding a pixel.
a step edge. Since in the case of the contour the contour itself can
Since this is a local estimate, it will be noisy but its lie along the diameter of the circle, we consider two half-window
objective will be to bootstrap the global segmentation partitions: one where the thin strip around the diameter is assigned
procedure. to the left side, and one where it is assigned to the other. We consider
Our method of computing this value is based on a both possibilities and retain the maximum of the two resulting χ 2
values.
simple comparison of texton distributions on either side
of a pixel relative to its dominant orientation. Consider
a generic pixel q at an oriented energy maximum. Let to a probability-like value using a sigmoid as follows:
the dominant orientation be θ. Consider a circle of ra-
dius α(q) (the selected scale) centered on q. We first ptexture = 1 −
1
2 (5)
divide this circle in two along the diameter with ori- 1 + exp − χLR − τ /β
entation θ. Note that the contour passing through q
is tangent to the diameter, which is its best straight This value, which ranges between 0 and 1, is small if
line approximation. The pixels in the disk can be parti- the distributions on the two sides are very different and
tioned into three sets D0 , D− , D+ which are the pixels large otherwise. Note that in the case of untextured re-
in the strip along the diameter, the pixels to the left gions, such as a brightness step edge, the textons lying
of D0 , and the pixels to the right of D0 , respectively. along and parallel to the boundary make the statistics
To compute our measure of texturedness, we consider of the two sides different. This is illustrated in Fig. 9.
two half window comparisons with D0 assigned to Roughly, ptexture ≈ 1 for oriented energy maxima in tex-
each side. Assume without loss of generality that D0 ture and ptexture ≈ 0 for contours. ptexture is defined to
is first assigned to the “left” half. Denote the K -bin be 0 at pixels which are not oriented energy maxima.
histograms of D0 ∪ D− by h L and D+ by h R respec-
tively. Now consider the χ 2 statistic between the two 4.3.2. Gating the Contour Cue. The contour cue is
histograms: gated by means of suppressing contour energy accord-
ing to the value of ptexture . The gated value, p B , is de-
1 K
[h L (k) − h R (k)]2 fined as
χ 2 (h L , h R ) =
2 k=1 h L (k) + h R (k) p B = (1 − ptexture ) pcon (6)
We repeat the test with the histograms of D− and In principle, this value can be computed and dealt with
D0 ∪ D+ and retain the maximum of the two result- independently at each filter scale. For our purposes, we
ing values, which we denote χLR
2
. We can convert this found it sufficient simply to keep the maximum value
Figure 10. Gating the contour cue. Left: original image. Top: oriented energy after nonmaximal suppression, OE∗ . Bottom: 1 − ptexture . Right:
p B , the product of 1 − ptexture and pcon = 1 − exp(−OE∗ /σIC ). Note that this can be thought of as a “soft” edge detector which has been
modified to no longer fire on texture regions.
Figure 11. Gating the texture cue. Left: original image. Top: Textons label, shown in pseudocolor. Middle: local scale estimate α(i). Bottom:
1 − ptexture . Darker grayscale indicates larger values. Right: Local texton histograms at scale α(i) are gated using ptexture as explained in 4.3.3.
of p B with respect to σ . The gated contour energy is is that the 0th bin will keep a count of the number of
illustrated in Fig. 10, right. The corresponding weight pixels which do not correspond to texture. These pix-
is then given by els arise in two forms: (1) pixels which are not oriented
energy maxima; (2) pixels which are oriented energy
WijIC = 1 − max p B (x) maxima, but correspond to boundaries between two re-
x∈Mij gions, thus should not take part in texture processing to
avoid the problems discussed in (§1). More precisely,
4.3.3. Gating the Texture Cue. The texture cue is ĥ i is defined as follows:
gated by computing a texton histogram at each pixel
which takes into account the texturedness measure ĥ i (k) = ptexture ( j) · I [T ( j) = k] ∀k = 1 . . . K
ptexture (see Fig. 11). Let h i be the K -bin texton his- j∈N (i)

togram computed using Eq. (2). We define a (K + 1)- ĥ i (0) = N B + (1 − ptexture ( j))
bin histogram ĥ i by introducing a 0th bin. The intuition j∈N (i)
20 Malik et al.
where N (i) denotes all the oriented energy maxima Note that these parameters are the same for all the re-
lying inside the window W(i) and N B is the number sults shown in (§6).
of pixels which are not oriented energy maxima.
5. Computing the Segmentation
4.3.4. Combining the Weights. After each cue has
been gated by the above procedure, we are free to per- With a properly defined weight matrix, the normal-
form simple multiplication of the weights. More specif- ized cut formulation discussed in (§3) can be used to
ically, we first obtain W IC using Eq. (6). Then we obtain compute the segmentation. However, the weight ma-
W TX using Eq. (4) with the gated versions of the his- trix defined in the previous section is computed using
tograms. Then we simply define the combined weight only local information, and is thus not perfect. The
as ideal weight should be computed in such a way that
region boundaries are respected. More precisely, (1)
Wij = WijIC × WijTX texton histograms should be collected from pixels in a
window residing exclusively in one and only one re-
4.3.5. Implementation Details. The weight matrix is gion. If instead, an isotropic window is used, pixels
defined between any pair of pixels i and j. Naively, one near a texture boundary will have a histogram com-
might connect every pair of pixels in the image. How- puted from textons in both regions, thus “polluting”
ever, this is not necessary. Pixels very far away from the histogram. (2) Intervening contours should only be
the image have very small likelihood of belonging to considered at region boundaries. Any responses to the
the same region. Moreover, dense connectivity means filters inside a region are either caused by texture or are
that we need to solve for the eigenvectors of a matrix simply mistakes. However, these two criteria mean that
of size Npix × Npix , where Npix is close to a million for we need a segmentation of the image, which is exactly
a typical image. In practice, a sparse and short-ranged the reason why we compute the weights in the first
connection pattern does a very good job. In our ex- place! This chicken-and-egg problem suggests an iter-
periments, all the images are of size 128 × 192. Each ative framework for computing the segmentation. First,
pixel is connected to pixels within a radius of 30. Fur- use the local estimation of the weights to compute a seg-
thermore, a sparse sampling is implemented such that mentation. This segmentation is done so that no region
the number of connections is approximately constant boundaries are missed, i.e. it is an over-segmentation.
at each radius. The number of non-zero connections Next, use this intial segmentation to update the weights.
per pixel is 1000 in our experiments. For images of Since the initial segmentation does not miss any region
different sizes, the connection radius can be scaled ap- boundaries, we can coarsen the graph by merging all
propriately. the nodes inside a region into one super-node. We can
The parameters for the various formulae are given then use these super-nodes to define a much simpler
here: segmentation problem. Of course, we can continue this
iteration several times. However, we elect to stop after
1. The image brightness lies in the range [0, 1]. 1 iteration.
2. σIC = 0.02 (Eq. (1)). The procedure consists of the following 4 steps:
3. The number of textons computed using K -means:
K = 36. 1. Compute an initial segmentation from the locally
4. The textons are computed following a contrast nor- estimated weight matrix.
malization step, motivated by Weber’s law. Let 2. Update the weights using the initial segmentation.
|F(x)| be the L 2 norm of the filter responses at 3. Coarsen the graph with the updated weights to re-
pixel x. We normalize the filter responses by the duce the segmentation to a much simpler problem.
following equation: 4. Compute a final segmentation using the coarsened
graph.
|F(x)|
log 1 + 0.03
F(x) ← F(x) × 5.1. Computing the Initial Segmentation
|F(x)|
5. σTX = 0.025 (Eq. (4)). Computing a segmentation of the image amounts

6. τ = 0.3 and β = 0.04 (Eq. (5)) to computing the eigenvectors of the generalized
eigensystem: (D−W)v = λ Dv (Eq. (3)). The eigenvec- It should be noted that this strategy for using multi-
tors can be thought of as a transformation of the image ple eigenvectors to provide an initial oversegmentation
into a new feature vector space. In other words, each is merely one of a set of possibilities. Alternatives in-
pixel in the original image is now represented by a vec- clude recursive splitting using the second eigenvector
tor with the components coming from the correspond- or first converting the eigenvectors into binary valued
ing pixel across the different eigenvectors. Finding a vectors and using those simultaneously as in Shi and
partition of the image is done by finding the clusters in Malik (2000). Yet another hybrid strategy is suggested
this eigenvector representation. This is a much simpler in Weiss (1999). We hope that improved theoretical in-
problem because the eigenvectors have essentially put sight into spectral graph partitioning will give us a bet-
regions of coherent descriptors according to our cue ter way to make this, presently somewhat ad hoc choice.
of texture and contour into very tight clusters. Simple
techniques such as K -means can do a very good job 5.2. Updating Weights
in finding these clusters. The following procedure is
taken: The initial segmentation S0 found in the previous step
can provide a good approximation to modify the weight
1. Compute the eigenvectors corresponding to the sec- as we have discussed earlier. With S0 , we modify the
ond smallest to the twelfth smallest eigenvalues of weight matrix as follows:
the generalized eigensystem ((D − W)v = λ Dv).6
Call these 11 eigenvectors vi , i = 2, . . . , 12. The – To compute the texton histograms for a pixel in Rk ,
corresponding eigenvalues are λi , i = 2, . . . , 12. textons are collected only from the intersection of
2. Weight7 the eigenvectors according to the eigen- Rk and the isotropic window of size determined by
values: v̂i = √1λ vi , i = 2, . . . , 12. The eigenval- the scale, α.
i
ues indicate the “goodness” of the corresponding – p B is set to zero for pixels that are not in the region
eigenvectors. Now each pixel is transformed to an boundaries of S0 .
11 dimensional vector represented by the weighted
eigenvectors. The modified weight matrix is an improvement over
3. Perform vector quantization on the 11 eigenvectors the original local estimation of weights.
using K -means. Start with K ∗ = 30 centers. Let the
corresponding RMS error for the quantization be
5.3. Coarsening the Graph
e∗ . Greedily delete one center at a time such that
the increase in quantization error is the smallest.
By hypothesis, since S0 is an over-segmentation of the
Continue this process until we arrive at K centers
image, there are no boundaries missed. We do not need
when the error e is just greater than 1.1 × e∗ .
to recompute a segmentation for the original problem
This partitioning strategy provides us with an initial of N pixels. We can coarsen the graph, where each
segmentation of the image. This is usually an over- node of the new graph is a segment in S0 . The weight
segmentation. The main goal here is simply to provide between two nodes in this new graph is computed as
an initial guess for us to modify the weights. Call this follows:

initial segmentation of the image S0 . Let the number of Ŵkl = Wij (7)
segments be N0 . A typical number for N0 is 10–100. i∈Rk j∈Rl
Figure 12. p B is allowed to be non-zero only at pixels marked.

22 Malik et al.
Figure 13. Initial segmentation of the image used for coarsening the graph and computing final segmentation.
Figure 14. Segmentation of images with animals.

Figure 15. Segmentation of images with people.
where Rk and Rl indicate segments in S0 (k and problem of very small size. We compute the final seg-
l ∈ {1, . . . , N0 }); Ŵ is the weight matrix of the coars- mentation using the following procedure:
ened graph and W is the weight matrix of the origi-
nal graph. This coarsening strategy is just an instance 1. Compute the second smallest eigenvector for the
of graph contraction (Chung, 1997). Now, we have generalized eigensystem using Ŵ .
reduced the original segmentation problem with an 2. Threshold the eigenvector to produce a bi-
N × N weight matrix to a much simpler and faster partitioning of the image. 30 different values uni-
segmentation problem of N0 × N0 without losing in formly spaced within the range of the eigenvector
performance. are tried as the threshold. The one producing a par-
tition which minimizes the normalized cut value is
chosen. The corresponding partition is the best way
5.4. Computing the Final Segmentation to segment the image into two regions.
3. Recursively repeat steps 1 and 2 for each of the
After coarsening the graph, we have turned the segmen- partitions until the normalized cut value is larger
tation problem into a very simple graph partitioning than 0.1.
24 Malik et al.
Figure 16. Segmentation of images of natural and man-made scenes.
5.5. Segmentation in Windows image. Call these enlarged windows Q̂ i . Note that these
windows now overlap each other.
The above procedure performs very well in images with Corresponding to each Q̂ i , a weight matrix Ŵ i is
a small number of groups. However, in complicated defined by pulling out from the original weight matrix
images, smaller regions can be missed. This problem W the edges whose end-points are nodes in Q̂ i . For each
is intrinsic for global segmentation techniques, where Ŵ i , an initial segmentation Ŝi0 is obtained, according
the goal is find a big-picture interpretation of the image. to the procedure in (§5.1). The weights are updated as
This problem can be dealt with very easily by perform- in (§5.2). The extension of each quadrant makes sure
ing the segmentation in windows. that the arbitrary boundaries created by the windowing
Consider the case of breaking up the image into do not affect this procedure:
quadrants. Define Q i to be the set of pixels in the ith
quadrant. Q i ∩ Q j = ∅ and ∪i=1 4
Q i = Image. Ex-
tend each quadrant by including all the pixels which Texton histogram upgrade For each pixel in Q i , the
are less than a distance r from any pixels in Q i , with r largest possible histogram window (a (2α +1)2 box)
being the maximum texture scale, α(i), over the whole is entirely contained in Q̂ i by virtue of the extension.
Figure 17. Segmentation of paintings.
This means the texton histograms are computed from the final segmentation is computed as in (§5.3) and
all the relevant pixels. (§5.4).
Contour upgrade The boundaries in Q i are a proper
subset of the boundaries in Q̂ i . So, we can set the
6. Results
values of p B at a pixel in Q i to be zero if it lies on
a region boundary in Q̂ i . This enables the correct
We have run our algorithm on a variety of natural im-
computation of WijIC . Two example contour update
ages. Figures 14–17 show typical segmentation results.
maps are shown in Fig. 12.
In all the cases, the regions are cleanly separated from
each other using combined texture and contour cues.
Initial segmentations can be computed for each Q̂ i Notice that for all these images, a single set of param-
to give Ŝi0 . They are restricted to Q i to produce Si0 . eters are used. Color is not used in any of these ex-
These segmentations are merged to form an initial seg- amples and can readily be included to further improve
mentation S0 = ∪i=1 4
Si0 . At this stage, fake boundaries the performance of our algorithm.8 Figure 14 shows
from the windowing effect can occur. Two examples results for animal images. Results for images contain-
are shown in Fig. 13. The graph is then coarsened and ing people are shown in Fig. 15 while natural and
26 Malik et al.
man-made scenes appear in Fig. 16. Segmentation re- necessary. Merging in this manner decreases the number of chan-
sults for paintings are shown in Fig. 17. A set of nels needed but necessitates the use of phase-shift information.
more than 1000 images from the commercially avail- 3. This is set to 3% of the image dimension in our experiments. This
is tied to the intermediate scale of the filters in the filter set.
able Corel Stock Photos database have been segmented 4. This is set to 10% of the image dimension in our experiments.
using our algorithm.9 5. Finding the true optimal partition is an NP-hard problem.
Evaluating the results against ground truth—What 6. The eigenvector corresponding to the smallest eigenvalue is con-
is the correct segmentation of the image?—is a chal- stant, thus useless.
lenging problem. This is because there may not be a 7. Since normalized cut can be interpreted as a spring-mass system
(Shi and Malik, 2000), this normalization comes from the equipar-
single correct segmentation and segmentations can be tition theorem in classical statistical mechanics which states that
to varying levels of granularity. We do not address this if a system is in equilibrium, then it has equal energy in each mode
problem here; a start has been made in recent work in (Belongie and Malik, 1998).
our group (Martin et al., 2000). 8. When color information is available, the similarity Wij becomes
Computing times for a C++ implementation of the a product of 3 terms: Wij = WijIC × WijTX × WijCOLOR . Color sim-
ilarity, WijCOLOR , is computed using χ 2 differences over color
entire system are under two minutes for images of size histograms, similar to texture measured using texture histograms.
108×176 pixels on a 750 MHz Pentium III machine. Moreover, color can clustered into “colorons”, analogous to tex-
There is some variability from one image to another tons.
because the eigensolver can take more or less time to 9. These results are available at the following web page: http://
converge depending on the image. www.cs.berkeley.edu/projects/vision/Grouping/overview.html
7. Conclusion
References
In this paper we have developed a general algorithm
for partitioning grayscale images into disjoint regions Belongie, S., Carson, C., Greenspan, H., and Malik, J. 1998. Color-
and texture-based image segmentation using EM and its appli-
of coherent brightness and texture. The novel con- cation to content-based image retrieval. In Proc. 6th Int. Conf.
tribution of the work is in cue integration for image Computer Vision, Bombay, India, pp. 675–682.
segmentation—the cues of contour and texture differ- Belongie, S. and Malik, J. 1998. Finding boundaries in natural im-
ences are exploited simultaneously. We regard the ex- ages: A new method using point descriptors and area completion.
perimental results as promising and hope that the paper In Proc. 5th Euro. Conf. Computer Vision, Freiburg, Germany, pp.
751–766.
will spark renewed research activity in image segmen- Binford, T. 1981. Inferring surfaces from images. Artificial Intelli-
tation, one of the central problems of computer vision. gence, 17(1–3):205–244.
Canny, J. 1986. A computational approach to edge detection. IEEE
Trans. Pat. Anal. Mach. Intell., 8(6):679–698.
Acknowledgments Chung, F. 1997. Spectral Graph Theory, AMS. Providence, RI.
DeValois, R. and DeValois, K. 1988. Spatial Vision. Oxford
The authors would like to thank the Berkeley vision University Press. New York, N.Y.
group, especially Chad Carson, Alyosha Efros, David Duda, R. and Hart, P. 1973. Pattern Classification and Scene Analy-
sis, John Wiley & Sons. New York, N.Y.
Forsyth, and Yair Weiss for useful discussions during Elder, J. and Zucker, S. 1996. Computing contour closures. In
the development of the algorithm. We thank Doron Proc. Euro. Conf. Computer Vision, Vol. I, Cambridge, England,
Tal for implementing the algorithm in C++. This re- pp. 399–412.
search was supported by (ARO) DAAH04-96-1-0341, Fogel, I. and Sagi, D. 1989. Gabor filters as texture discriminator.
the Digital Library Grant IRI-9411334, NSF Graduate Biological Cybernetics, 61:103–113.
Geman, S. and Geman, D. 1984. Stochastic relaxation, Gibbs distri-
Fellowships to SB and JS and a Berkeley Fellowship bution, and the Bayesian retoration of images. IEEE Trans. Pattern
to TL. Anal. Mach. Intell., 6:721–741.
Gersho, A. and Gray, R. 1992. Vector Quantization and Signal Com-
pression, Kluwer Academic Publishers, Boston, MA.
Notes Heeger, D.J. and Bergen, J.R. 1995. Pyramid-based texture analy-
sis/synthesis. In Proceedings of SIGGRAPH ’95, pp. 229–238.
1. For more discussions and variations of the K -means algorithm, Jacobs, D. 1996. Robust and efficient detection of salient convex
the reader is referred to Duda and Hart (1973) and Gersho and groups. IEEE Trans. Pattern Anal. Mach. Intell., 18(1):23–37.
Gray (1992). Jones, D. and Malik, J. 1992. Computational framework to deter-
2. It is straightforward to develop a method for merging translated mining stereo correspondence from a set of linear spatial filters.
versions of the same basic texton, though we have not found it Image and Vision Computing, 10(10):699–708.
Julesz, B. 1981. Textons, the elements of texture perception, and their Morrone, M. and Owens, R. 1987. Feature detection from local en-
interactions. Nature, 290(5802):91–97. ergy. Pattern Recognition Letters, 6:303–313.
Knutsson, H. and Granlund, G. 1983. Texture analysis using two- Mumford, D. and Shah, J. 1989. Optimal approximations by piece-
dimensional quadrature filters. In Workshop on Computer Archi- wise smooth functions, and associated variational problems.
tecture for Pattern Analysis and Image Database Management, Comm. Pure Math., 42:577–684.
pp. 206–213. Parent, P. and Zucker, S. 1989. Trace inference, curvature consis-
Koenderink, J. and van Doorn, A. 1987. Representation of local ge- tency, and curve detection. IEEE Trans. Pattern Anal. Mach. In-
ometry in the visual system. Biological Cybernetics, 55(6):367– tell., 11(8):823–839.
375. Perona, P. and Malik, J. 1990. Detecting and localizing edges com-
Koenderink, J. and van Doorn, A. 1988. Operational significance of posed of steps, peaks and roofs. In Proc. 3rd Int. Conf. Computer
receptive field assemblies. Biological Cybernetics, 58:163–171. Vision, Osaka, Japan, pp. 52–57.
Leung, T. and Malik, J. 1998. Contour continuity in region-based Puzicha, J., Hofmann, T., and Buhmann, J. 1997. Non-parametric
image segmentation. In Proc. Euro. Conf. Computer Vision, Vol. 1, similarity measures for unsupervised texture segmentation and
H. Burkhardt and B. Neumann (Eds.). Freiburg, Germany, pp. 544– image retrieval. In Proc. IEEE Conf. Computer Vision and Pattern
559. Recognition, San Juan, Puerto Rico, pp. 267–272.
Leung, T. and Malik, J. 1999. Recognizing surfaces using three- Raghu, P., Poongodi, R., and Yegnanarayana, B. 1997. Unsupervised
dimensional textons. In Proc. Int. Conf. Computer Vision, Corfu, texture classification using vector quantization and deterministic
Greece, pp. 1010–1017. relaxation neural network. IEEE Transactions on Image Process-
Malik, J., Belongie, S., Shi, J., and Leung, T. 1999. Textons, contours ing, 6(10):1376–1387.
and regions: Cue integration in image segmentation. In Proc. IEEE Sha’ashua, A. and Ullman, S. 1988. ‘Structural saliency: The detec-
Intl. Conf. Computer Vision, Vol. 2, Corfu, Greece, pp. 918–925. tion of globally salient structures using a locally connected net-
Malik, J. and Perona, P. 1990. Preattentive texture discrimination with work. In Proc. 2nd Int. Conf. Computer Vision, Tampa, FL, USA,
early vision mechanisms. J. Optical Society of America, 7(2):923– pp. 321–327.
932. Shi, J. and Malik, J. 1997. Normalized cuts and image segmentation.
Malik, J. and Perona, P. 1992. Finding boundaries in images. In Neu- In Proc. IEEE Conf. Computer Vision and Pattern Recognition,
ral Networks for Perception, Vol. 1, H. Wechsler (Ed.). Academic San Juan, Puerto Rico, pp. 731–737.
Press, pp. 315–344. Shi, J. and Malik, J. 2000. Normalized cuts and image segmentation.
Martin, D., Fowlkes, C., Tal, D., and Malik, J. 2000. A database of IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888–905.
human segmented natural images and its application to evaluat- Weiss, Y. 1999. Segmentation using eigenvectors: A unifying view.
ing segmentation algorithms and measuring ecological statistics. In Proc. IEEE Intl. Conf. Computer Vision, Vol. 2, Corfu, Greece,
Technical Report UCB CSD-01-1133, University of California pp. 975–982.
at Berkeley. http://http.cs.berkeley.edu/projects/vision/Grouping/ Wertheimer, M. 1938. Laws of organization in perceptual forms (par-
overview.html. tial translation). In A Sourcebook of Gestalt Psychology, W. Ellis
McLean, G. 1993. Vector quantization for texture classification. (Ed.). Harcourt Brace and Company, pp. 71–88.
IEEE Transactions on Systems, Man, and Cybernetics, 23(3):637– Williams, L. and Jacobs, D. 1995. Stochastic completion fields: A
649. neural model of illusory contour shape and salience. In Proc. 5th
Montanari, U. 1971. On the optimal detection of curves in noisy Int. Conf. Computer Vision, Cambridge, MA, pp. 408–415.
pictures. Comm. Ass. Comput., 14:335–345. Young, R.A. 1985. The Gaussian derivative theory of spa-
Morrone, M. and Burr, D. 1988. Feature detection in human vision: tial vision: Analysis of cortical cell receptive field line-
A phase dependent energy model. Proc. R. Soc. Lond. B, 235:221– weighting profiles. Technical Report GMR-4920, General Motors
245. Research.
View publication stats

Download

Uploaded by

Copyright:

Available Formats

Download

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Download

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Contour and Texture Analysis for Image Segmentation

Article in International Journal of Computer Vision · June 2001

Jitendra Malik Serge Belongie

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Contour and Texture Analysis for Image Segmentation

JITENDRA MALIK, SERGE BELONGIE, THOMAS LEUNG∗ AND JIANBO SHI†

Keywords: segmentation, texture, grouping, cue integration, texton, normalized cut

1. Introduction age properties such as brightness, color and texture.

1.2. Introducing Textons 1.3. Summary of Our Approach

Introduction (§1) by suitably gating the cues. The spirit

4.3.1. Estimating Texturedness. As illustrated in

5. σTX = 0.025 (Eq. (4)). Computing a segmentation of the image amounts

Figure 12. p B is allowed to be non-zero only at pixels marked.

Figure 14. Segmentation of images with animals.

Figure 15. Segmentation of images with people.

Figure 16. Segmentation of images of natural and man-made scenes.

Figure 17. Segmentation of paintings.

View publication stats

You might also like