Visual Pattern Recognition
Visual Pattern Recognition
Visual Pattern Recognition
net/publication/2634384
CITATIONS READS
19 1,482
1 author:
Michael J. Tarr
Carnegie Mellon University
260 PUBLICATIONS 16,750 CITATIONS
SEE PROFILE
All content following this page was uploaded by Michael J. Tarr on 19 February 2014.
Perceiving the visual world around us is one of the information that captures the finer and more variable
most basic acts of everyday experience. Although details of objects. Thus it is not clear that a single
pattern recognition may seem effortless, it is actually a recognition system can support categorization at many
complex problem--so much so that the visual areas different levels.
responsible for this process occupy up to one-half of
our cortex. Fundamental to our perception is the PATTERN PERCEPTION
transformation of the light array that falls on our retinae
Before objects can be identified at any level, the
into coherent surfaces and objects. How this is done is
contours and surfaces defining them must be grouped
still a matter of some debate, but results from
into coherent wholes or patterns. Remember that what
psychophysical, neuropsychological, and physiological
arrives at our eyes is an undifferentiated array of light
studies point towards a remarkably adaptive system that
intensities, but what we perceive are surfaces and
supports a wide range of recognition tasks.
objects. At the most elementary level it is clear that our
visual system performs a feature analysis at every
THE FLEXIBILITY OF HUMAN RECOGNITION
location in the scene. What this means is that the optic
One of the hallmarks of the human pattern array is transformed into a description of local visual
recognition system is its extreme flexibility. As properties, for example, the orientation of edges or the
observers we are able to identify objects under a wide color over restricted regions. This recoding provides
array of conditions that confound even the most important information about what is happening in the
powerful computer vision systems. For example, we scene in terms of visual properties that we “care” about,
can recognize objects at many different categorical but does not tell us how to put such features together to
levels. Often, however, it is assumed that objects are form more complex percepts. Thus, it is typically
first identified at the entry level (Jolicoeur, Gluck, & assumed that, following feature detection, principles of
Kosslyn, 1984)--defined as the name which is generated perceptual organization are used to combine local
or matched most rapidly to a given object, e.g., “apple” features into more global structures, for example,
or “bird” (although less typical instances may be extended contours or textured 3D surfaces. Principles
identified at a more specific level, e.g., “penguins”). such as similarity (similar features are grouped together)
While entry-level recognition is certainly an or good continuation (features that form a straight line
important element of everyday recognition, it is not the or a smooth curve are grouped together) were originally
only level at which objects are recognized. We often elucidated by the Gestalt school of psychology during
identify objects at a more specific level, sometimes the early 1900’s and, remarkably, they are still
referred to as the subordinate level, e.g., a “McIntosh considered fundamental today.
apple” or a “white-breasted nuthatch.” Such finer Beyond such simple principles, there are many
discriminations require additional perceptual analysis and other processes that appear to contribute to the
thus typically take longer than entry-level judgments. perception of complex patterns (for a good overview of
Beyond subordinate-level recognition, we also recognize many aspects of pattern perception, see the readings in
objects at the individual- or exemplar- level, e.g., “the Rock, 1990). Two of the best known are structure-from-
McIntosh apple I brought for lunch.” Making such motion and shape-from-shading. In the former case, 3D
judgments requires specific information about a given surfaces can be perceived because local 2D motions can
individual object and consequently even greater be integrated according to the principle that they must
perceptual analysis. all arise from a single moving rigid object whose
The fact that we are capable of performing any of surface orientations only change gradually. Likewise, by
these tasks with remarkable precision indicates that attending to the change in shading (or texture for that
visual recognition is best thought of as a continuum matter) over a surface and again assuming only
ranging from rather coarse to incredibly fine perceptual gradually shifts in orientation, we can perceive the 3D
discriminations (Tarr & Bülthoff, 1995). However, as shape of an object. Even these principles, however, are
pointed out by Marr and Nishihara (1978), there is a insufficient to specify completely the complex nature of
tradeoff between the information that captures the more a typical scene. What is ultimately necessary is that we
general and less variable properties of objects and the separate each object from both other objects and the
1
2 TARR
background--a process known as figure-ground in viewpoint (Tarr, 1995). One might think that
segregation. Presumably we rely on cues such as because we are able to recognize known objects quite
discontinuities in color, texture, or shape, but the well from almost any viewing position that object
precise mechanisms for accomplishing figure-ground are representations must be viewpoint invariant. In fact,
still poorly understood--indeed, this is one of the since we are already familiar with real-world objects
reasons why computer vision systems are so bad at from many different viewpoints, it is just as likely that
object recognition. we have learned multiple viewpoint-specific
representations for each known object or class.
RECOGNIZING OBJECTS IN A CHANGING To investigate these alternatives, Tarr and Pinker
WORLD (1989) taught observers to name novel objects from a
single orientation. When they tested generalization to
Compounding the already difficult task of
new picture-plane orientations, Tarr and Pinker found
segmenting individual objects out of a complex scene is
that observers were fastest at the familiar orientation and
the variability we encounter in viewing conditions at
progressively slower at unfamiliar orientations further
different moments in time. The recognition system
and further from the trained orientation. With practice,
must contend with images of objects that vary with
however, observers became equally fast at all known
changes in almost any viewing parameter, including
orientations. Tarr and Pinker hypothesized that this
occlusion, illumination, orientation, 3D viewpoint,
learning was analogous to the apparent viewpoint
position, size, or configuration (Figure 1). Almost any
independence exhibited for known objects--invariance
source of variability may affect recognition performance
obtained by virtue of multiple views. This hypothesis
with recent evidence suggesting that the degree to which
was tested by introducing additional new orientations for
a change impairs recognition depends on the categorical
the now-familiar objects. While observers continued to
level of the recognition judgment. For example,
show equivalent performance for all familiar
increasing the similarity between the actual target object
orientations, they again took longer to recognize the
and other potential target objects (as in increasingly
objects in unfamiliar orientations--now, however, with
subordinate-level tasks) typically increases recognition
performance dependent on the nearest familiar
costs across changes in viewpoint (Tarr, 1995). As
orientation. Similar results have also been obtained for
discussed in the following section, the bulk of object
novel 3D objects rotated in depth (Tarr, 1995).
recognition research has focused on variability due to
Converging evidence for multiple image-dependent
changes in viewpoint--presumably because rotating an
representations--often referred to as view-based models--
object in depth produces such dramatic changes in the
comes from physiological research. For example,
image.
Logothetis, Pauls, & Poggio (1995) trained monkeys to
recognize novel objects from several different 3D
viewpoints. With practice the monkeys, like humans,
became equally good at recognizing the objects from
any of the training viewpoints. When Logothetis et al.
recorded the responses of cells in the inferior temporal
Figure 1. Examples of some of the variability the visual cortex (IT) of the monkeys they found that many cells
recognition system must overcome. From left to right we responded selectively to a previously novel object and,
can still recognize the fan despite: partial occlusion, a crucially, maximally to a single viewpoint that had
change in illumination, a change in viewpoint, and a been shown during training. As with recognition
change in configuration. performance in humans there was a gradual decrease in a
given cell’s response as the preferred object was rotated
CURRENT THEORIES OF OBJECT RECOGNITION away from the familiar viewpoint. Thus, an ensemble
How the visual system compensates for variation of neurons, each tuned to a different viewpoint, may
in the image forms the core of almost all current represent a 3D object.
theories of object recognition. As a starting point most While such results are intriguing, many aspects of
theories assume either relatively image-invariant or view-based models are underspecified. For instance,
relatively image-dependent representations. Viewpoint is there is as yet no clear definition of what features are
often taken to be the diagnostic case and theories used to represent each view of an object. Although
typically predict either small and discrete performance many theorists have used simplified features (such as
costs across changes in viewpoint (Biederman, 1987) or vertices specified in linear image coordinates), they are
large and continuous performance costs across changes quick to point out that view-based models are unlikely
to rely on such features. In particular, features based on
VISUAL PATTERN RECOGNITION 3
spatial coordinates such as pixels are highly unstable; cases in which subjects apparently lose the ability to
rather it is presumed that higher-order features such as recognize human faces--raising the possibility of class-
surface patches, edge features, or bounding contours are specific recognition systems). Thus, there is not much
used, albeit in a viewpoint-specific manner. support for separable recognition systems for different
The best known alternative to view-based models levels of object identification. More likely is that a
are structural-description models that typically assume single system can be fined-tuned in response to
image-invariant representations (Marr & Nishihara, perceptual experience, thereby mediating multiple
1978; Biederman, 1987). The fundamental assumption categorical levels. Indeed, theorists have begun to
of these models is that objects are represented in terms consider the possibility of a single recognition system
of features that are stable over changes in the image, for that can support both coarse categorical judgments and
example, parts described as 3D volumes. A second finer discriminations, adaptively selecting the most
assumption is that configurations of parts are described appropriate features according to the task at hand.
relative to one another rather than relative to the
observer or the world. Marr and Nishihara (1978) FACE RECOGNITION
assumed that observers could use the major axes of an
Although the notion of a unified recognition
object to recover the shape of almost any part--because
system is appealing, there are certain phenomena that
the parts were 3D volumes described in an viewpoint-
point towards specialized recognition mechanisms. One
independent manner, once abstracted away from the
of the most notable examples is the case of face
image, a description of a given object was identical
recognition, where brain-injured patients, brain imaging
regardless of the viewing position. More recently,
studies, and behavioral results all appear to indicate a
Biederman (1987) has proposed a related scheme in
face-specific recognition system. Perhaps the most
which there is only a limited set (~30) of qualitatively-
compelling piece of evidence is the phenomenon of
defined 3D volumes, e.g., “brick” or “cone.” While a
prosopagnosia--a syndrome in which brain injury to
restricted set of parts may make object representation
visual cortex results in a profound inability to recognize
more tractable in a computational sense, it limits the
individual faces (see Farah, 1992). While it has been
theory to entry-level recognition (since many variations
argued that prosopagnosic subjects are more impaired at
in fine structure are mapped into a single volume).
recognizing faces relative to other objects, it is possible
More fundamentally, both of these schemes, as well as
that face recognition occurs at a more subordinate level
other models based on 3D volumes, have been dogged
as compared to common object recognition. Indeed,
by the question of how to recover descriptions of parts
almost all prosopagnosics have some difficulties
from images--at present there is no workable solution to
recognizing non-face objects.
this problem.
A second piece of evidence for face-specific
One element that appears to be missing from both
processing is the finding using functional magnetic
view-based and structural-description models is how to
resonance imaging (fMRI) that certain areas of IT are
account for the flexibility of recognition across different
more active for face recognition as compared to
categorical levels. It has often been claimed that
common object recognition (Sergent, Ohta, &
structural descriptions are best suited to entry-level
MacDonald, 1992). As mentioned, faces and common
categorization, while view-based models are best suited
objects, however, are generally recognized at different
to subordinate-level or individual recognition. Such a
categorical levels--faces at the individual level and
hypothesis is not entirely satisfactory if recognition is
common objects at the class level. What happens when
to be thought of as continuum of different levels of
common objects are also recognized at a more specific
access. Where does one draw the boundary between one
level? Using fMRI Gauthier et al. (1997) observed that
process and the other? Moreover, while it is known that
the same areas of IT found to be more active for face
damage to parts of the visual system can result in object
recognition are also more active for subordinate-level
agnosia--an inability to visually identify certain types of
recognition of common objects as compared to entry-
objects--there is little evidence to suggest that agnosic
level recognition of the same objects. Therefore, this
subjects simply loose the ability to recognize objects at
area of IT is more plausibly involved in finer levels of
either the entry-level or the subordinate-level (the
recognition, regardless of stimulus class, rather than in
pattern expected if one of the two systems was
face recognition per se.
completely removed). Rather agnosics seem to show a
more complex pattern of sparing and loss as if selective
deficits occur according to the types of processing
subsystems that are impaired in the individual
(although, as reviewed below, there are a number of
4 TARR
BIBLIOGRAPHY
Biederman, I. (1987). Recognition-by-components: A
theory of human image understanding.
Psychological Review, 94, 115-147.
Farah, M. J. (1992). Is an object an object an object?
Cognitive and neuropsychological investigations of
domain-specificity in visual object recognition.
Current Directions in Psychological Science, 1,
164-169.
Gauthier, I., Anderson, A. W., Tarr, M. J., Skudlarski,
P., & Gore, J. C. (1997). Levels of categorization
in visual object studied with functional MRI.
Current Biology, 7, 645-651.
Gauthier, I., & Tarr, M. J. (1997). Becoming a
“Greeble” expert: Exploring the face recognition
mechanism. Vision Research, 37, 1673-1682.
Jolicoeur, P., Gluck, M., & Kosslyn, S. M. (1984).
Pictures and names: Making the connection.
Cognitive Psychology, 16, 243-275.
Logothetis, N. K., Pauls, J., & Poggio, T. (1995).
Shape representation in the inferior temporal cortex
of monkeys. Current Biology, 5, 552-563.
Figure 2. Examples of Greeble objects used to study
perceptual expertise. Four Greebles from one family are Marr, D., & Nishihara, H. K. (1978). Representation
shown, the top two being of one gender and the bottom two and recognition of the spatial organization of three-
being of the other gender. dimensional shapes. Proc R Soc of Lond B, 200,
269-294.
Finally, there are many studies that have reported Rock, I. (Ed.). (1990). The Perceptual World. New
“face-specific” behavioral effects. One of the most York, NY: W. H. Freeman and Company.
interesting is that of “holistic” processing for faces. Sergent, J., Ohta, S., & MacDonald, B. (1992).
Tanaka and Farah (see summary in Farah, 1992) found Functional neuroanatomy of face and object
that observers were poorer at recognizing part of a processing: A positron emission tomography
trained face, e.g., “Bob’s nose,” if other parts of the study. Brain, 115, 15-36.
face, e.g., the eyes, were transformed from the original Tarr, M. J. (1995). Rotating objects to recognize them:
configuration. This configural sensitivity is surprising A case study of the role of viewpoint dependency in
in that the recognition of an individual part would seem the recognition of three-dimensional objects.
to be independent of other features. In contrast, Psychonomic Bulletin and Review, 2, 55-82.
identifying individual parts of trained houses did not Tarr, M. J., & Bülthoff, H. H. (1995). Is human object
produce this effect, suggesting face specificity. On the recognition better described by geon-structural-
other hand, observers are almost always perceptual descriptions or by multiple-views? Journal of
experts at individual face recognition, but rarely so for Experimental Psychology: Human Perception and
other classes of objects (exceptions being birdwatchers Performance, 21, 1494-1505.
and the like). To test whether perceptual expertise rather Tarr, M. J., & Pinker, S. (1989). Mental rotation and
than the stimulus class produces configural sensitivity, orientation-dependence in shape recognition.
Gauthier and Tarr (1997) created a novel class of Cognitive Psychology, 21, 233-282.
objects--”Greebles” (Figure 2). Observers unfamiliar
with Greebles did not show configural sensitivity.
Greeble experts (created through 10 hours of training),
however, showed configural sensitivity in identifying
individual Greeble parts. Thus, factors such as the
degree of perceptual expertise, rather than the stimulus
class, are apparently responsible for what was
previously thought to be face-specific processing.