cs131 Class Notes PDF
cs131 Class Notes PDF
cs131 Class Notes PDF
COMPUTER VISION:
F O U N D AT I O N S A N D
A P P L I C AT I O N S
S TA N F O R D U N I V E R S I T Y
Copyright © 2017 Compiled by Ranjay Krishna
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in com-
pliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/
LICENSE-2.0. Unless required by applicable law or agreed to in writing, software distributed under the
License is distributed on an “as is” basis, without warranties or conditions of any kind, either
express or implied. See the License for the specific language governing permissions and limitations under
the License.
Color 25
Edge Detection 55
Feature Descriptors 81
Image Resizing 89
Clustering 107
Motion 189
Tracking 201
Bibliography 211
List of Figures
3 Mixing two lights produces colors that lie along a straight line in color
space. Mixing three lights produces colors that lie within the trian-
gle they define in color space. 27
4 Representation of RBG primaries and corresponding matching func-
tions. The matching functions are the amounts of primaries needed
to match the monochromatic test color at the wavelength shown on
the horizontal scale. Source: https://en.wikipedia.org/wiki/CIE_
1931_color_space 28
5 Source: https://en.wikipedia.org/wiki/CIE_1931_color_space 28
6 _
General source: https://en.wikipedia.org/wiki/HSL and HSV _ 29
7 Example of two photos, one unbalanced, and one with incorrect white
balancing. Source: http://www.cambridgeincolour.com/tutorials/
white-balance.htme 30
22 The corner of a curve appears at two scales. Note that the circular
window on the right curve captures the entire corner, while the same-
sized window on the left curve does not. Instead, we must choose
a much larger circular window on the left curve to get the same in-
formation. Source: Lecture 7, slide 12. 83
23 Two plots of the response of f (window) as a function of window size
for Images 1 and 2, where Image 2 is similar to Image 1 but scaled
by 12 . Source: Lecture 7, slide 15. 83
24 On the left: pyramid of Gaussians of different σ’s and different im-
age sizes. On the right: difference of adjacent Gaussians. Source: http://aishack.in/tutorials/sift-
scale-invariant-feature-transform-log-approximation/ 84
25 Given a coordinate in x-y-scale space (denoted by the black X), ex-
amine its 26 neighbors (denoted by the green circles) to determine
if the original coordinate is a local extrema. Source: Lecture 7, slide
22 84
26 This figure shows the percentage of correctly matched keypoints as
a function of the width of the descriptor and of the number of his-
togram bins. [1] 86
27 Here we see a visual example of keeping track of the magnitudes of
the gradients for each gradient direction. Source: Lecture 7, Slide 60. 87
28 HoG applied to a bicycle. Source: Lecture 7, Slide 65. 87
95 In the aperture problem, the line appears to have moved to the right
when only in the context of the frame, but the true motion of the line
was down and to the right. The aperture problem is a result of op-
tical flow being unable to represent motion along an edge–an issue
that can lead to other errors in motion estimation as well. 190
96 Conditions for a solvable matrix A T A may be interpreted as differ-
ent edge regions depending on the relation between λ1 and λ2 . Cor-
ner regions produce more optimal conditions. 192
97 Example of regions with large λ1 and small λ2 (left), small λ1 and
smallλ2 (center, low texture region), large λ1 and large λ2 (right, high
texure region) 193
Definition
Two definitions of computer vision Computer vision can be defined as
a scientific field that extracts information out of digital images. The type of
information gained from an image can vary from identification, space
measurements for navigation, or augmented reality applications.
Another way to define computer vision is through its applications.
Computer vision is building algorithms that can understand the content
of images and use it for other applications. We will see in more details in
the last section from the different domains where computer vision is
applied.
have a 50-year old scientific field which is still far from being solved.
An interdisciplinary field
Computer vision brings together a large set of disciplines. Neuro-
science can help computer vision by first understanding human
vision, as we will see later on. Computer vision can be seen as a part
of computer science, and algorithm theory or machine learning are
essential for developing computer vision algorithms.
We will show in this class how all the fields in figure 13 are con-
nected, and how computer vision draws inspiration and techniques
from them.
A hard problem
Computer vision has not been solved in 50 years, and is still a very
hard problem. It’s something that we humans do unconsciously but
that is genuinely hard for computers.
Poetry harder than chess The IBM supercomputer Deep Blue defeated
for the first time the world chess champion Garry Kasparov in 1997.
Today we still struggle to create algorithms that output well formed
sentences, let alone poems. The gap between these two domains
show that what humans call intelligence is often not a good criteria
to assess the difficulty of a computer task. Deep Blue won through
brute force search among millions of possibilities and was not more
intelligent than Kasparov.
Definition of vision
Be it a computer or an animal, vision comes down to two compo-
nents.
First, a sensing device captures as much details from an image
as possible. The eye will capture light coming through the iris and
project it to the retina, where specialized cells will transmit informa-
tion to the brain through neurons. A camera captures images in a
similar way and transmit pixels to the computer. In this part, cameras
are better than humans as they can see infrared, see farther away or
with more precision.
Second, the interpreting device has to process the information
and extract meaning from it. The human brain solves this in multiple
steps in different regions of the brain. Computer vision still lags
behind human performance in this domain.
There has been a myth that the brain cannot understand itself. It is
compared to a man trying to lift himself by his own bootstraps. We feel
that is nonsense. The brain can be studied just as the kidney can.
around 150ms to recognize an animal from a normal nature scene. Catherine Marlot. Speed of processing
in the human visual system. nature, 381
Figure 2 shows how the brain responses to images of animals and (6582):520, 1996
non-animals diverge after around 150ms.
Context Humans use context all the time to infer clues about images.
Previous knowledge is one of the most difficult tool to incorporate
into computer vision. Humans use context to know where to focus
on an image, to know what to expect at certain positions. Context
also helps the brain to compensate for colors in shadows.
However, context can be used to fool the human brain.
Special effects Shape and motion capture are new techniques used
in movies like Avatar to animate digital characters by recording the
movements played by a human actor. In order to do that, we have to
22 computer vision: foundations and applications
Face detection Face detection has been used for multiple years in
cameras to take better pictures and focus on the faces. Smile detec-
tion can allow a camera to take pictures automatically when the
subject is smiling. Face recognition is more difficult than face de-
tection, but with the scale of today’s data, companies like Facebook
are able to get very good performance. Finally, we can also use com-
puter vision for biometrics, using unique iris pattern recognition or
fingerprints.
Augmented Reality AR is also a very hot field right now, and multi-
ple companies are competing to provide the best mobile AR platform.
Apple released ARKit in June and has already impressive applica-
tions 10 . 10
check out the different apps
introduction to computer vision 23
Physics of Color
What is color?
Color is the result of interaction between physical light in the envi-
ronment and our visual system. A psychological property of our
visual experiences when we look at objects and lights, not a physical
property of those objects or lights 11 . 11
Stephen E Palmer. Vision science:
Photons to phenomenology. MIT press,
1999
Color and light
White light is composed of almost equal energy in all wavelengths of
the visible spectrum
Electromagnetic Spectrum
Light is made up of waves of different wavelengths. The visual spec-
trum of light ranges from 400nm to 700nm, and humans are most
sensitive to light with wavelengths in the middle of this spectrum.
Humans see only visible light because the Sun emits yellow light
more than any other color and due to its temperature.
Visible light
Plank’s Law for Blackbody radiation estimates the wavelengths
of electromagnetic radiation emitted by a star, based on surface
temperature. For instance, since the surface of the sun is around
5800K, the peak of the sun’s emitted light lies in the visible region.
Color Matching
Since we are interested in designing systems that provide a consistent
visual experience across viewers, it is helpful to understand the
minimal colors that can be combined to create the experience of
any perceivable color. An experiment from Wandell’s Foundations
of Vision (Sinauer Assoc., 1995) demonstrates that most people
report being able to recreate the color of a given test light by tuning
three experimental lights of differing colors. The only condition
is that each of three lights must be a primary color. Moreover, the
experiment showed that for the same test light and primaries, the
majority of people select similar weights, though color blind people
are an exception. Finally, this experiment validates the trichromatic
theory of color - the proposition that three numbers are sufficient for
encoding color - which dates from Thomas Young’s writings in the
1700s.
Color Spaces
Definition
Color space, also known as the color model (or color system), is an
abstract mathematical model which describes the range of colors as
tuples of numbers, typically as 3 or 4 values or color components (e.g.
RGB). A color space may be arbitrary or structured mathematically.
Most color models map to an absolute and globally understood
system of color interpretation.
• RGB Space
Figure 4: Representation of
RBG primaries and correspond-
ing matching functions. The
matching functions are the
amounts of primaries needed to
match the monochromatic test
color at the wavelength shown
on the horizontal scale. Source:
https://en.wikipedia.org/
wiki/CIE_1931_color_space
White Balancing
Definition
1. The sensors in cameras or film are different from those in our eyes
3. The viewing conditions when the image was taken are usually
different from the image viewing conditions
Von Kries’ method for white balancing was to scale each channel by a
"gain factor" to match the appearance of a gray neutral object.
In practice, the best way to achieve this is the Gray Card Method:
hold up a neutral (gray or white) and determine the values of each
channel. If we find that the card has RGB values rw , gw , bw , then we
scale each channel of the image by r1w , g1w , b1w .
30 computer vision: foundations and applications
Other Methods
Without Gray Cards, we need to guess which pixels correspond to
white objects. Several methods attempt to achieve this, including
statistical and Machine Learning models (which are beyond the scope
of this class)
Gamut Mapping The Gamut of an image is the set of all pixel colors
displayed in an image (in mathematical terms, this is a "convex hull"
and a subset of all possible color combinations). We can then apply a
transformation to the image that maps the gamut of the image to the
gamut of a "standard" image under white light.
Vectors
v1
v2
v=
..
.
vn
h i
A row vector v T ∈ R1xn where v T = v1 v2 . . . vn . T denotes
the transpose operation which flips a matrix over its diagonal, switch-
ing the row and column indices of the matrix.
As a note, CS 131 will use column vectors as the default. You can
transpose vectors in python: for vector v, do v.t.
Uses of Vectors There are two main uses of vectors. Firstly, vectors
can represent an offset in 2 or 3 dimensional space. In this case, a
point is a vector from the origin. Secondly, vectors can represent data
(such as pixels, gradients at an image keypoint, etc.). In this use case,
the vectors do not have a geometric interpretation but calculations
like "distance" can still have value.
32 computer vision: foundations and applications
Matrices
A matrix A ∈ Rmxn is an array of numbers with m rows and n
columns:
a11 a12 a13 . . . a1n
a21 a22 a23 . . . a2n
A= .. .. .. .. ..
. . . . .
" # " #
a b 3a 3b
Scaling ∗3 =
c d 3c 3d
s
n
Norm k x k2 = ∑ xi2
i =1
More formally, a norm is any function f : Rn → R that satisfies
four properties:
Example Norms:
n
One Norm k x k1 = ∑ | x i |
i =1
1/p
n p
General P Norm kxk p = ∑ xi
i =1
s
m n p
Matrix Norm k Ak F = ∑ ∑ A2ij = tr ( A T A)
i =1 j =1
Each entry of the matrix product is made by taking the dot product
of the corresponding row in the left matrix, with the corresponding
column in the right one.
Matrix multiplication is
1. Associative: ( AB)C = A( BC )
2. Distributive: ( A + B)C = AC + BC
• det(AB) = det(BA)
• det(A−1 ) = 1
det(A)
• det(AT ) = det(A)
" #
1 3
Trace tr (A) is the sum of diagonal elements. For A = ,
5 7
tr (A) = 1 + 7 = 8
Properties:
• tr (AB) = tr (BA)
• tr (A + B) = tr (A) + tr (B)
Special Matrices
• Identity Matrix: a square matrix with 1’s along the diagonal and
0’s elsewhere.
I *another matrix = that matrix. Example identity
1 0 0
matrix: 0 1 0
0 0 1
Vector
A column vector v ∈ IRnx1 where
v1
v2
v=
..
.
vn
Matrix
A matrix A ∈ IRmxn is an array of numbers with size m by n.
a11 a12 a13 · · · a1n
a21 a22 a23 · · · a2n
A= . ..
.. .
Transformation Matrices
Matrix Inverse
AA−1 = A−1 A = I
• ( A −1 ) −1 = A
• A − T , ( A T ) −1 = ( A −1 ) T
Pseudoinverse
Frequently in linear algebra problems, you want to solve the equa-
tion AX = B for X. You would like to compute A−1 and multiply
both sides to get X = A−1 B. In python, this command would be:
np.linalg.inv(A)*B.
However, for large floating point matrices, calculating inverses can
be very expensive and possibly inaccurate. An inverse could also not
even exist for A. What should we do?
Luckily, we have what is known as a pseudoinverse. This other
matrix can be used to solve for AX = B. Python will try many meth-
ods, including using the pseudo-inverse, if you use the following
command: np.linalg.solve(A,B). Additionally, using the pseudo-
inverse, Python finds the closest solution if there exists no solution to
AX = B.
Matrix Rank
Definitions
An eigenvector x of a linear transformation A is a non-zero vector that,
when A is applied to it, does not change its direction. Applying A to
the eigenvector scales the eigenvector by a scalar value λ, called an
eigenvalue.
The following equation describes the relationship between eigen-
values and eigenvectors:
Ax = λx, x 6= 0
Ax = λx, x 6= 0
Ax = (λIx), x 6= 0
(λI − A)x = 0, x 6= 0
Since we are looking for non-zero x, we can equivalently write the
above relation as:
|λI − A| = 0
Solving this equation for λ gives the eigenvalues of A, and these
can be substituted back into the original equation to find the corre-
sponding eigenvectors.
linear algebra primer 39
Properties
• The trace of A is equal to the sum of its eigenvalues:
n
tr ( A) = ∑ λi
i =1
Spectral Theory
Definitions
( A − λI ) = 0
σ( A) = {λ ∈ C : λI − A is singular}
ρ( A) = lim || Ak ||1/k
k→∞
Proof :
|λ|k ||v|| = ||λ|k v|| = || Ak v||
By the CauchyâĂŞSchwarz inequality (||uv|| ≤ ||u|| · ||v||):
Since v 6= 0:
|λ|k ≤ || Ak ||
ρ( A) = lim || Ak ||1/k
k→∞
Diagonalization
An n × n matrix A is diagonalizable if it has n linearly independent
eigenvectors.
Most square matrices are diagonalizable
A∗ A = AA∗
A = VDV −1
Symmetric Matrices
If A is symmetric, then all its eigenvalues are real, and its eigenvec-
tors are orthonormal. Recalling the above diagonalization equation,
we can diagonalize A by:
A = VDV T
Using the above relation, we can also write the following relation-
ship:
Given y = V T x:
n
x T Ax = x T VDV T x = y T Dy = ∑ λi y2i
i =1
Applications
Some applications of eigenvalues and eigenvectors include, but are
not limited to:
• PageRank
• Schroedinger’s equation
Matrix Calculus
The Gradient
If a function f : IRmxn 7→ IR takes as input a matrix A of size (m × n)
and returns a real value, then the gradient of f is
∂ f ( A) ∂ f ( A) ∂ f ( A)
···
∂∂A 11 ∂A12 ∂A1n
f ( A) ∂ f ( A)
··· ∂ f ( A)
∇ A f ( A) ∈ IRmxn =
∂A
21 ∂A22 ∂A2n
.. .. .. ..
. . . .
∂ f ( A) ∂ f ( A) ∂ f ( A)
∂Am1 ∂Am2 ··· ∂Amn
∂ f ( A)
∇ A f ( A)ij =
∂Aij
42 computer vision: foundations and applications
• ∇ x ( f ( x ) + g( x )) = ∇ x f ( x ) + ∇ x g( x )
• For t ∈ IR, ∇ x (t f ( x )) = t∇ x f ( x )
The Hessian
∂2 f ( x ) ∂2 f ( x ) ∂2 f ( x )
···
∂x12 ∂x1 ∂x2 ∂x1 ∂xn
∂2 f ( x ) ∂2 f ( x ) ∂2 f ( x )
∂x22
···
∇2x f ( x ) ∈ IRnxn =
∂x2 ∂x1 ∂x2 ∂xn
.. .. .. ..
.
. . .
∂2 f ( x ) ∂2 f ( x ) ∂2 f ( x )
∂xn ∂x1 ∂xn ∂x2 ··· ∂xn2
∂2 f ( x )
∇2x f ( x )ij =
∂xi ∂x j
∂2 f ( x ) ∂2 f ( x )
=
∂xi ∂x j ∂x j ∂xi
linear algebra primer 43
Example Calculations
Example Gradient Calculation For x ∈ IRn , let f ( x ) = b T x for some
known vector b ∈ IRn
x1
x2
iT
h
f ( x ) = b1 b2 ··· bn .
..
xn
Thus,
n
f (x) = ∑ bi x i
i =1
n
∂ f (x) ∂
∂xk
=
∂xk ∑ bi x i = b k
i =1
∂
∂xk i∑
= [ ∑ Aij xi x j + ∑ Aik xi xk + ∑ Akj xk x j + Akk xk2 ]
6=k j6=k i 6=k j6=k
∂2 f ( x ) ∂ ∂ f (x) ∂ n
∂xk i∑
= [ ]= [ 2Ali xi ]
∂xk ∂xl ∂xk ∂xl =1
= 2Alk = 2Akl
Thus,
∇2x f ( x ) = 2A
Pixels and Filters
Image Types
Binary Images contain pixels that are either black (0) or white (1).
Color Images have multiple color channels; each color image can
be represented in different color models (e.g., RGB, LAB, HSV). For
example, an image in the RGB model consists of red, green, and blue
channel. Each pixel in a channel has intensity values ranging from
0-255. Please note that this range depends on the choice of color
model. A 3D tensor usually represents color images (Width x Length
x 3), where the 3 channels can represent the color model such as RGB
(Red-Green-Blue), LAB (Lightness-A-B), and HSV (Hue-Saturation-
Value).
Images are samples: they are not continuous; they consist of discrete
pixels of a certain size and density. This can lead to errors (or grain-
iness) because pixel intensities can only be measured with a certain
resolution and must be approximated.
Pixels are quantized (i.e, all pixels (or channels of a pixel) have
one of a set numbers of values (usually [0, 255]). Quantization and
sampling loses information due to a finite precision.
46 computer vision: foundations and applications
Image Histograms
Images as Functions
Most images that we deal with in computer vision are digital, which
means that they are discrete representations of the photographed
scenes. This discretization is achieved through the sampling of
2-dimensional space onto a regular grid, eventually producing a
representation of the image as a matrix of integer values.
When dealing with images, we can imagine the image matrix as
infinitely tall and wide. However, the displayed image is only a finite
subset of this infinite matrix. Having employed such definition of
images, we can write them as coordinates in a matrix
.. .. .
. . ..
f [−1, 1] f [0, 1] f [1, 1]
. . . f [−1, 0] f [0, 0] f [0, 1] . . .
f [−1, −1] f [0, −1] f [1, −1] . . .
. .. ..
.. . .
The term filtering refers to a process that forms a new image, the pixel
values of which are transformations of the original pixel values. In
48 computer vision: foundations and applications
Examples of Filters
Moving Average One intuitive example of a filter is the moving
average. This filter sets the value of a pixel to be the average of its
neighboring pixels (e.g., the nine pixels in a 3 × 3 radius, when
applying a 3 × 3 filter). Mathematically, we can represent this as
1 1 1
g[m, n] = ∑ ∑
9 i=−1 j=−1
f [m − i, n − j] (Weighted Average)
This weighted average filter serves to smooth out the sharper edges
of the image, creating a blurred or smoothed effect.
Properties of Systems
When discussing specific systems, it is useful to describe their prop-
erties. The following includes a list of properties that a system may
possess. However, not all systems will have all (or any) of these prop-
erties. In other words, these are potential characteristics of individual
systems, not traits of systems in general.
Amplitude Properties
S[α f i [n, m]] + β f j [n, m]] = αS[ f i [n, m]] + βS f j [n, m]]
for some c.
Spatial Properties
f [m, n] = 0 =⇒ g[m, n] = 0
S
f [ m − m0 , n − n0 ] −
→ g [ m − m0 , n − n0 ]
Linear Systems
A linear system is a system that satisfies the property of superposition.
When we employ a linear system for filtering; we create a new image
whose pixels are weighted sums of the original pixel values, using
50 computer vision: foundations and applications
the same set of weights for each pixel. A linear shift-invariant system is
a linear system that is also shift invariant.
Linear systems also have what is known as an impulse response. To
determine the impulse response of a system S , consider first δ2 [m, n].
This is a function defined as follows
1 m = 0andn = 0
δ2 [m, n] =
0 otherwise
r = S[δ2 ]
We can then use the superposition property to write any linear shift-
invariant system as a weighted sum of such shifting system
∞ ∞ ∞ ∞
α1 ∑ ∑ f [i, j]δ2,1 [m − i, n − j] + α2,2 ∑ ∑ f [i, j]δ2,3 [m − i, n − j] + . . .
i =−∞ j=−∞ i =−∞ j=−∞
Convolution
The easiest way to think of convolution is as a system that uses
information from neighboring pixels to filter the target pixel. A good
example of this is a moving average, or a box filter discussed earlier.
pixels and filters 51
Correlation
Cross correlation is the same as convolution, except that the filter ker-
nel is not flipped. Two-dimensional cross correlation is represented
as:
r [k, l ] = ∑∞ ∞
m=−∞ ∑n=−∞ f [ m + k, n + l ] g [ m, n ]
It can be used to find known features in images by using a kernel
that contains target features.
Linear Systems
Linear Systems (Filters) form new images, the pixels of which are a
weighted sum of select groupings of the original pixels. The use of
different patterns and weights amplifies different features within the
original image. A system S is a linear system If and Only If it satisfies
the Superposition Property of systems:
S[α f i [n, m] + β f j [h, m]] = αS[ f i [n, m]] + βS[ f j [h, m]]
= ∑ ∑ h[ N, M] · f [n − N, m − M]
N M
= h[n, m] ∗ ∗ f [n, m]
Biederman
Biederman investigated the rate at which humans can recognize the
object they’re looking at. To test this, he drew outlines of common
and recognizable objects and split them into two halves, with each
56 computer vision: foundations and applications
line segment divided in only one of the halves. These outlines were
then shown to partipants to test whether they could recognize the
original objects while only seeing half of the original outline.
Surprisingly, he observed no difference in terms of the speed
with which people recognized the objects. It was easy for them to
recognize an object via only parts of its edges. This study benefits
computer vision by providing an insight: even if only part of the
original image is shown, a system theoretically should still be able to
recognize the whole object or scene.
[0, 1, −1]
Forward:
df
= f ( x ) − f ( x + 1) = f 0 ( x )
dx
[−1, 1, 0]
Central:
df
= f ( x + 1) − f ( x − 1) = f 0 ( x )
dx
[1, 0, −1]
Discrete Derivative in 2D
Gradient vector
" #
f
∇ f ( x, y) = x
fy
Gradient magnitude
q
|∇ f ( x, y)| = f x2 + f y2
Gradient direction
df
−1 dy
θ = tan df
dx
58 computer vision: foundations and applications
Example
The gradient at a matrix index can be approximated using neighbor-
ing pixels based on the central discrete derivative equation expanded
to 2D. A filter like
−1 0 1
1
−1 0 1
3
−1 0 1
When overlapped on top of a pixel x [m, n], such that the center of the
filter is located at x [m, n] shown with its neighbors below
... ... ... ... ...
... x
m−1,n−1 xm−1,n xm−1,n+1 ...
... xm,n−1 xm,n xm,n+1 ...
... xm+1,n−1 xm+1,n xm+1,n+1 ...
... ... ... ... ...
Produces an output of
1
( xm−1,n+1 − xm−1,n−1 ) + ( xm,n+1 − xm,n−1 ) + ( xm+1,n+1 − xm+1,n−1 ) =
3
Which is equivalent to an approximation of the gradient in the hor-
izontal (n) direction at pixel (m, n). This filter detects the horizontal
edges, and a separate kernel is required to detect vertical ones.
Characterizing Edges
Characterizing edges (i.e., characterizing them properly so they can
be recognized) is an important first step in detecting edges. For our
purposes, we will define an edge as a rapid place of change in the
image intensity function. If we plot the intensity function along the
horizontal scanline, we see that the edges correspond to the extra of
the derivative. Therefore, noticing sharp changes along this plot will
likely give us edges.
Image Gradient
The gradient of an image has been defined as the following:
∂f ∂f
∇f = , ,
∂x ∂y
while its direction has been defined as:
∂f ∂f
θ = tan−1 / .
∂y ∂x
The gradient vectors point toward the direction of the most rapid
increase in intensity. In vertical edge, for example, the most rapid
change of intensity occurs in the x-direction.
Effects of Noise
If there is excessive noise in an image, the partial derivatives will not
be effective for identifying the edges.
In order to account for the noise, the images must first be smoothed.
This is a process in which pixel values are recalculated so that they
more closely resemble their neighbors. The smoothing is achieved by
means of convolving the image with a filter (e.g., gaussian kernel).
There are, of course, some concerns to keep in mind when smooth-
ing an image. The image smoothing does remove noise, but it also
blurs the edges; the use of large filters can result in the loss of edges
and the finer details of the image.
60 computer vision: foundations and applications
Gaussian Blur
1. Good detection
2. Good localization
3. Silent response
Edge Detection
Fig. 1. Certain areas of the brain react to edges; the line drawings
are as recognizable as the original image; image source: 14 14
Dirk B. Walther, Barry Chai, Eamon
Caddigan, Diane M. Beck, and Li Fei-
Fei. Simple line drawings suffice for
functional mri decoding of natural
Edge Basics scene categories. Proceedings of the
National Academy of Sciences, 108(23):
There are four possible sources of edges in an image: surface normal 9661âĂŞ–9666, 2011
Discrete Derivatives
d f (x) f ( x ) − f ( x − δx )
= lim = f 0 (x)
dx δx →0 δx
d f (x) f (x) − f (x − 1
= = f 0 (x)
dx 1
d f (x)
= f ( x ) − f ( x − 1) = f 0 ( x )
dx
It is also possible to take the derivative three different ways
• Backward: f 0 ( x ) = f ( x ) − f ( x − 1)
• Forward: f 0 ( x ) = f ( x + 1) − f ( x )
( x +1)− f ( x −1)
• Central: f 0 ( x ) = 2
• Forward: f 0 ( x ) = f ( x ) − f ( x + 1) → [−1, 1, 0]
We can also calculate the magnitude and the angle of the gradient:
q
|∇ f ( x, y)| = f x2 + f y2
θ = tan−1 f y / f x
Reducing noise
Noise will interfere with the gradient, making it impossible to find
edges using the simple method, even though the edges are still
detectable to the eye. The solution is to first smooth the image.
64 computer vision: foundations and applications
d
( f ∗ g)
dx
By the derivative theorem of convolution:
d d
( f ∗ g) = f ∗ g
dx dx
This simplification saves us one operation. Smoothing removes noise
but blurs edges. Smoothing with different kernel sizes can detect
edges at different scales
The Sobel Filter has many problems, including poor localization. The
Sobel Filter also favors horizontal and vertical edges over oblique
edges
• Suppress noise
• Hysteresis thresholding
Suppress noise We can both suppress noise and compute the deriva-
tives in the x and y directions using a method similar to the Sobel
filter.
edge detection 65
θ = tan−1 f y / f x
Hough Transforms
Say we have a xi , yi . There are infinite lines that could pass through
this point. We can define a line that passes through this pixel xi , yi as
yi = a ∗ xi + b
b = − a ∗ xi + yi
Accumulator Cells
In order to get the "best" lines, we quantize the a, b space into cells.
For each line in our a, b space, we add a "vote" or a count to each cell
that it passes through. We do this for each line, so at the end, the
cells with the most "votes" have the most intersections and therefore
should correspond to real lines in our image.
The algorithm for Hough transform in a, b space is as follows:
68 computer vision: foundations and applications
Concluding Remarks
Advantages of the Hough Transform is that it is conceptually simple
(just transforming and finding intersection in Hough space). It is also
fairly easy to implement, and it can handle missing and occluded
data well. Another advantage is that it can find other structures other
than lines, as long as the structure has a parametric equation.
Some disadvantages include that it gets more computationally
complex the more parameters you have. It can also only look for one
kind of structure (so not lines and circles together). The length and
the position of a line segment can also not be detected by this. It can
be fooled by "apparent" lines, and co-linear line segments cannot be
separated.
edge detection 69
RANSAC
Introduction
ters) is to be fitted to the data; while the majority of data points fit a
linear model, the two points in the top right corner can significantly
affect the accuracy of overall fit (if they are included in the fit). The
RANSAC algorithm aims to address this challenge by identifying the
"inliers" and "outliers" in the data.
RANSAC randomly selects samples of the data, with the assump-
tion that if enough samples are chosen, there will be a low probabil-
ity that at all samples provides a bad fit.
The Algorithm
The RANSAC algorithm iteratively samples nominal subsets of the
original data (e.g., 2 points for line estimation); the model is fitted
to each sample, and the number of "inliers" corresponding to this
fit is calculated; this includes the data points that are close to the
fitted model. The points closer than a threshold (e.g., 2 standard
deviation, or a pre-determined number of pixels) are considered
"inliers". The fitted model is considered good if a big fraction of the
data is considered as "inliers" for that fit. In the case of a good fit, the
model is re-fitted using all the inliers, and the outliers are discarded.
This process is repeated, and model estimates with a big enough
fraction of inliers (e.g., bigger than a pre-specified threshold) are
compared to choose the best-performing fit. Fig. 8 illustrates this
process for a linear model and its three samples. The third sample
(Fig. 8c) provides the best fit as it includes the most number of
features and fitting 73
inliers.
5. repeat steps 1-4 and finally keep the estimate with most inliers
and best fit.
P f = (1 − W n ) k = 1 − p
where W and n are respectively the fraction of inliers and the num-
ber of points needed for model fitting. The minimum number of
samples:
log(1 − p)
k=
log(1 − W n )
Fig. 10. The number of samples for different choices of noise popu-
lation and model size; source: David Lowe)
RANSAC
Goal
RANdom SAmple Consensus (RANSAC) is used for model fitting
in images (e.g., line detection); it can be extremely useful for object
identification, among other applications. It is often more effective
than pure edge detection which is prone to several limitations: edges
containing extra points due to the noise/clutter, certain parts of the
edges being left out, and the existence of noise in measured edges’
orientation.
Motivation
One of the primary advantages of RANSAC is that it is relatively
efficient and accurate even when the number of parameters is high.
However, it should be noted that RANSAC is likely to fail or produce
inaccurate results in images with a relatively large amount of noise.
General Approach
The intuition for RANSAC is that by randomly sampling a group of
points in an edge and applying a line of best fit to those points many
times, we have a high probability of finding a line that fits the points
very well. Below is the general process "RANSAC loop":
2. Compute a line of best fit among the seed group. For example, if
the seed group is only 2 distinct points, then it is clear to see that
there is only one line that passes through both points, which can
be determined with relative ease from the points’ coordinates.
76 computer vision: foundations and applications
3. Find the number of inliers to this line by iterating over each point
in the data set and calculating its distance from the line; if is less
than a (predetermined) threshold value, it counts as an inlier.
Otherwise, it is counted as an outlier.
Drawbacks
The biggest drawback to RANSAC is its inefficient handling of highly
noisy data; an increase in the fraction of outliers in a given data set
results in an increase in the number of samples required for model
fitting (e.g., line of best fit). More importantly, the noisier an image is,
the less likely it is for a line to ever be considered sufficiently good at
fitting the data. This is a significant problem because most real world
problems have a relatively large proportion of noise/outliers.
Motivation
The local invariant image features and their descriptors are used in
a wide range of computer vision applications; they include, but are
not limited to, object detection, classification, tracking, motion esti-
mation, panorama stitching, and image registration. The previously
discussed methods, such as cross-correlation, are not effective and
robust in many of such appl. This method works by finding local,
distinctive structures within an image (i.e., features), and it describes
each feature using the surrounding region (e.g., a small patch cen-
tered on the detected feature). This "local" representation of image
features (as opposed to a "global" one, such as cross-correlation)
provides a more robust means of addressing the above-mentioned
computer vision problems; such strategy is invariant to object rota-
tions, point of view translations, and scale changes.
General Approach
The general approach for employing local invariant features is de-
tailed below:
Requirements
Good local features should have the following properties:
Keypoint Localization
Motivation
The goal of keypoint localization is to detect features consistently
and repeatedly, to allow for more precise localization, and to find
interesting content within the image.
General Approach
We will look for corners since they are repeatable and distinctive in
the majority of images. To find corners, we look for large changes in
intensity in all directions. To provide context, a "flat" region would
not have change in any direction, and an edge would have no change
along the direction of the edge. We will find these corners using the
Harris technique.
Harris Detector
The intuition behind Harris detector is that if a window (w) slides
over an image, the change in the intensity of pixel values caused
by the shift is highest at corners. This is because change in pixel
intensity is observed in both directions (x and y) at corners, while it
is limited to only one direction at the edges, and it is negligible at flat
image regions. To calculate the change of intensity due to the shift
[u, v]:
E(u, v) = ∑ w(x, y)[ I (x + u, y + v) − I (x, y)]2
x,y
Motivation
Solution
General Approach
We can find the local maximum of a function. Relative to the local
maximum, the region size should be the same regardless of the scale.
This also means that the region size is co-variant with the image scale.
A "good" function results in a single and distinct local maximum.
In general, we should use a function that responds well to stark
contrasts in intensity.
L = σ2 ( Gxx ( x, y, σ ) + Gyy ( x, y, σ ))
DoG = G ( x, y, kσ ) − G ( x, y, σ )
1 x 2 + y2
−
G ( x, y, σ) = √ e 2σ2
2πσ
Both these kernels are scale and rotation invariant.
Motivation
Thus far, we have covered the detection of keypoints in single images,
but broader applications require such detections across similar
images at vastly different scales. For example, we might want to
search for pedestrians from the video feed of an autonomous vehicle
without the prior knowledge of the pedestrians’ sizes. Similarly, we
might want to stitch a panorama using photos taken at different
scales. In both cases, we need to independently detect the same
keypoints at those different scales.
General methodology
Currently, we use windows (e.g., in Harris Corner Detection) to
detect keypoints. Using identically sized windows will not enable
the detection of the same keypoints across different-sized images
feature descriptors 83
Scale invariance
2. Very slow
3. limit artifacts.
However, what is considered “important” is very subjective, for what
may be important to one observer may not be important to another.
Pixel energy
A way to decide what is considered “important” is using saliency
measures. There are many different types of saliency measures,
but the concept is the same: each pixel p has a certain amount of
“energy” that can be represented by the function E( p).
The concept is that pixels with higher energy values are more
salient, or more important, than pixels with lower energy values.
What actually goes into the heart of E is up to the beholder.
A good example is to use the gradient magnitude of pixel p to
heavily influence E( p), for this usually indicates an edge. Since
humans are particularly receptive to edges, this is a part of the image
that is potentially valuable and interesting, compared to something
that has a low gradient magnitude. As a result, this preserves strong
contours and is overall simple enough to produce nice results. This
example of E for image I could be represented as
90 computer vision: foundations and applications
∂I ∂I
E(I) = + .
∂x ∂y
Seam carving
References
Seam Carving
Basic Idea
The human vision is more sensitive to the edges in an images. Thus,
a simple but effective solution is to remove contents from smoother
areas and preserve the more informative image regions that con-
tain edges; this is achieved using a gradient-based energy function,
defined as:
∂ ∂
E( I ) = | I | + | I |
∂x ∂y
Unimportant contents are, therefore, pixels with smaller values of
energy function.
Pixel Removal
There exist different approaches for removing the unimportant pixels,
and each can lead to different visual results. The figure below demon-
strates three examples of such approaches; the first two (i.e., optimal
least-energy pixel and row removal) are observed to negatively affect
the image quality. The last one (i.e., least-energy column removal), on
the other hand, works significantly better, but it still causes plenty of
artifacts in the new image. An alternative solution is presented in the
next section.
A Seam
1. A seam is defined as a connected path of pixels from top to bot-
tom (or left to right). For top-to-bottom pixel, we shall pick exactly
one pixel from each row. The mathematical definition is
2. The optimal seam is the seam which minimizes the energy func-
tion, based on pixel gradients.
∂ ∂
s∗ = argmins E(s), where E( I ) = | I| + | I|
∂x ∂y
The average energy of images increase given that the seam carv-
ing algorithm removes low energy pixels. The described seam
carving algorithm can be used to modify aspect ratio, to achieve
object removal, and to perform image resizing. The process is
the same if an image is flipped. When resizing, both horizontal
and vertical seams need to be removed. One can solve the order
of adding and removing seams in both directions by dynamic
programming. Specifically, the recurrence relation is: T (r, c) =
min( T (r − 1, c) + E(s x ( In−r−1×m−c )), T (r, c − 1) + E(sy ( In−r×m−c−1 )))
for more information refer to the SIGGRAPH paper on seam carving
20 . 20
Ariel Shamir Shai Avidan. Seam
carving for content-aware image
resizing. ACM Trans. Graph, 26(3):10,
2007
94 computer vision: foundations and applications
Algorithm 1: Seam-Carving
1: im ← original image of size m × n
2: n0 ← desired image size n’
3:
4: Do (n-n’) times:
5: E ← Compute energy map on im
6: s ← Find optimal seam in E
7: im ← Remove s from im
8:
9: return im
Image Expansion
A similar approach can be employed to increase the size of images.
By expanding the least important areas of the image (as indicated
by our seams), the image dimensions can be increased without
impacting the main content. A naive approach is to iteratively find
and duplicate the lowest energy seam. However, this provides us
results as depicted below:
image resizing 95
On the right side of the image (Figure 9), one seam has been dupli-
cated repeatedly. This is the program retrieves the same (least-energy)
seam repeatedly. A more effective implementation is to find the first
k seams at once and duplicate each:
Figure 10: A more effective image expansion strategy using the first k
low-energy seams 22 22
https://commons.wikimedia.org/wiki/file:broadway_tow
b
Object Removal
By allowing users to specify which areas of the image to give high
or low energy, we can use seam carving to specifically preserve or
remove certain objects. The algorithm chooses seams specifically so
that they pass through the given object (in green below).
Figure 12: The seam carving algorithm can be used for object
removal by assigning low energy values to part of the image 24 . 24
Ariel Shamir Shai Avidan. Seam
carving for content-aware image
resizing. ACM Trans. Graph, 26(3):10,
Limitations 2007
While the flat and smooth image regions (i.e., with low gradients)
are important to the image, they are removed; for example, this
includes the woman’s cheeks and forehead. While these regions are
low in energy, they are important features to human perception and
should be preserved. To address such limitations, the energy function
can be modified to consider additional information. For example, a
face detector can be used to identify important contents (i.e., human
faces) or other constraints can be applied by the users.
insert the least energy into the image. This approach is known as the
forward energy; our original accumulated cost matrix is modified
by adding the forward energy from corresponding new neighbors
as seen in the image below. The originally introduced method is
referred to as "backward" energy.
Introduction
The devices used for displaying images and videos are of different
sizes and shapes. The optimal image/video viewing configuration,
therefore, varies across devices and screen sizes. For this reason, im-
age resizing is of great importance in computer vision. The intuitive
idea is to rescale or crop the original image to fit the new device, but
that often causes artifacts or even loss of important content in images.
This lecture discussed the techniques used for resizing images while
preserving important content and limiting the artifacts.
Problem Statement
Input an image of size n × m and return an image of desired size
n0 × m0 which will be a good representative of original image. The
expectations are:
1. The new image should adhere to device geometric constraints.
2. The new image should preserve the important content and struc-
tures.
Importance Measures
1. A function, S : p → [0, 1], is used to determine which parts in
an image are important; then, different operators can be used
to resize the image. One solution is to use an optimal cropping
window to extract the most important contents, but this may result
in the loss of important contents.
Segmentation
image below, you might see zebras, or you might see a lion.
Figure 21: Common fate provides visual cues for the segmentation
problem; source: Arthus-Bertrand (via F. Durand)
We can also illustrate common fate with this optical illusion. This
illusion, called the MÃijller-Lyer illusion, tricks us into thinking the
bottom line segment is longer than the top line segment, even though
they are actually the same length (disregarding the four mini-tails).
Figure 23: Proximity can aid the image segmentation; source: Kristen
Grauman
Clustering
We can now more clearly see that the image is a few 9’s occluded
by the gray lines. This is an example of a continuity through occlu-
sion cue. The gray lines give us a cue that the black pixels are not
separate and should in fact be grouped together. By grouping the
black pixels together and not perceiving them as separate objects, our
brain is able to recognize that this picture contains a few digits.
Below is another example:
What do you see? Do you see a cup or do you see the side of 2
heads facing each other? Either option is correct, depending on your
perspective. This variation in perception is due to what we identify as
the foreground and the background. If we identify the black pixels as
clustering 109
Agglomerative Clustering
Distance Measures
We measure the similarity between objects by determining the dis-
tance between them: the smaller the distance, the higher the degree
of similarity. There are several possible distance functions, but it is
hard to determine what makes a good distance metric, so usually the
focus is placed on standard, well-researched distance metrics such as
the two detailed below.
sim( x, x 0 ) = x T x 0 (1)
x T x0
= (3)
|| x || · || x 0 ||
x T x0
= √ p (4)
x T x x0 T x0
clustering 111
Now that the potential distance metrics are defined, the next step
is to choose a clustering technique. There are various properties of
clustering methods that we might want to consider when choosing
specific techniques:
1. Single link:
Distance is calculated with the formula:
2. Complete link:
Distance is calculated with the formula:
3. Average link:
Distance is calculated with the formula:
∑ x ∈ Ci , x 0 ∈ Cj d( x, x 0 )
d(Ci , Cj ) = (7)
|Ci | · |Cj |
114 computer vision: foundations and applications
K-Means Clustering
At the top left of figure 11, we have an image with three distinct
color regions, so segmenting the image using color intensity can be
achieved by assigning each color intensity, shown on the top right,
to a different cluster. In the bottom left image, however, the image is
cluttered with noise. To segment the image, we can use k-means.
SSD = ∑ ∑ ( x − c i )2 (8)
i ∈clusters x ∈clusteri
Algorithm
4. Repeat Steps 2-3 until the value of the cluster centers stops chang-
ing or the algorithm has reached the maximum number of itera-
tions.
Output
Segmentation as Clustering
K-Means++
K-means method is appealing due to its speed and simplicity but
not its accuracy. By augmentation with a variant on choosing the
initial seeds for the k-means clustering problems, arbitrarily bad
clusterings that are sometimes a result of k-means clustering may be
avoided. The algorithm for choosing the initial seeds for k-means++
is outlined as following:
3. Repeat the previous step until k centers have been chosen and
then proceed with the usual k-means clustering process as the
initial seeds have been selected
Evaluation of clusters
The clustering results can be evaluated in various ways. For example,
there is an internal evaluation measure, which involves giving a
single quality score the results. External evaluation, on the other
hand, compares the clustering results to an existing true classification.
More qualitatively, we can evaluate the results of clustering based on
its generative measure: how well is the reconstruction of points from
the clusters or is the center of the cluster a good representation of the
data. Another evaluation method is a discriminative method where
we evaluate how well clusters correspond to the labels. We check if
the clusters are able to separate things that should be separated. This
measure can only be worked with supervised learning as there are no
labels associated with unsupervised learning.
Cons
Mean-shift Clustering
picture, we see the algorithm will generate 2 clusters. All the data
points on the left converge onto one center and all the data points on
the right converge onto a different center.
120 computer vision: foundations and applications
Mean-Shift
Optimizations
To correctly assign points to clusters, a window must be initialized
at each point and shifted to the most dense area. This procedure can
result in a large number of redundant or very similar computations.
It is possible to improve the speed of the algorithm by computing
window shifts in parallel or by reducing the number of windows that
must be shifted over the data; this is achieved using a method called
"basin of attraction".
same cluster. This is because the window has just been shifted to an
area of higher density, thus it is likely that points close to this area all
belong to the same cluster
the computation cost will be. However, the resulting cluster assign-
ments will have less error if mean-shift was calculated without this
method. The larger the values, the more nearby points are assigned
resulting in faster speed increases, but also the possibility that the
final cluster assignments will be less accurate to standard mean-shift.
Technical Details
n
x − xi
1
fˆK =
nhd
∑K h
, (9)
i =1
Mean-Shift Procedure
From a given point xt , calculate the following steps in order to reach
the center of the cluster.
∇ f ( xi ) = 0 (14)
Kernel Functions
A kernel, K ( x ) is a non-negative function that integrates to 1 over all
values of x. These requirements ensure that kernel density estimation
will result in a probability density function.
Examples of popular kernel functions include:
• Uniform (rectangular)
1
– K(x) = 2 where | x | ≤ 1 and 0 otherwise
• Gaussian
1 2
– K ( x ) = √ 1 e− 2 u
2π
• Epanechnikov (parabolic)
Mean-Shift Conclusions
The mean-shift algorithm has many pros and cons to consider.
Positives of mean-shift:
Negatives of mean-shift:
Object recognition
Challenges
The challenges to building good object recognition methods include
both image and category complications. Since computers can only
see the pixel values of images, object recognition methods must
account for a lot of variance.
Deformation Objects can change form and look very different, while
still being considered the same object. For example, a person can be
photographed in a number of poses, but is still considered a person if
they’re bending over or if their arms are crossed.
K-nearest neighbors
Supervised learning
We can use machine learning to learn how to label an image based
upon its features. The goal of supervised learning is to use an exist-
ing data set to find the following equation:
y = f (x) (15)
Figure 3: Decision regions and decision boundaries (in red) for three
categories, R1 , R2 , and R3 over a feature space with two dimension. ?
Heuristics are used to break ties, and are evaluated based upon what
works the best.
Figure 4: For the + data point, the green circle represents it’s
k-nearest neighbors, for k = 5. Since three out of five of its nearest
neighbors are green circles, our test data point will be classified as a
green circle. 31 31
the value of K is too high, then the neighborhood may include points
from other classes. Similarly, as K increases, the decision boundaries
will become more smooth, and for a smaller value of K, there will be
more smaller regions to consider.
Solution: Normalize!
Normalizing the vectors to unit length will guarantee that the
proper Euclidean measurement is used.
Bias-variance trade-off
:
The key to generalization error is to get low, generalized error, by
finding the right number/type of parameters. There are two main
components of generalization error: bias and variance. Bias is defined
as how much the average model over all the training sets differs
from the true model. Variance is defined as how much the models
estimated from different training sets differ from each other. We need
132 computer vision: foundations and applications
to find the right balance between bias and variance, hence, the bias-
variance trade-off. Models with too few parameters are inaccurate
because of a large bias (not enough flexibility). Similarly, models with
too many parameters are inaccurate because of a large variance (too
much sensitivity to the sample). To types of incorrect fitting will be
listed below:
Overview
Dimensionality reduction is a process for reducing the number of
features used in an analysis or a prediction model. This enhances
the performance of computer vision and machine learning-based
approaches and enables us to represent the data in a more efficient
way. There are several methods commonly used in dimensionality
reduction. The two main methods covered in this lecture are Singular
Value Decomposition (SVD) and Principal Component Analysis
(PCA).
Motivation
Dimension reduction benefits models for a number of reasons.
In such cases the computational cost per data point may be re-
duced by many orders of magnitude with a procedure like SVD
Overview
Intuitively, Singular Value Decomposition (SVD) is a procedure that
allows one to represent the data in a new sub-feature space, such
that the majority of variations in the data is captured; this is achieved
by "rotating the axes" of the original feature space to form new axes
which are linear combinations of the original axes/features (e.g. age,
income, gender, etc. of a customer). These âĂIJnew axesâĂİ are
useful because they systematically break down the variance in the
data points (how widely the data points are distributed) based on
each directions contribution to the variance in the data:
The result of this process is a ranked list of "directions" in the
feature space ordered from most variance to least. The directions
along which there is greatest variance are referred to as the "princi-
pal components" (of variation in the data); by focusing on the data
distribution along these dimensions, one can capture most of the in-
formation represented in the original feature space without having to
deal with a high number of dimensions in the original space (but see
below on the difference between feature selection and dimensionality
reduction).
Continuing with the SVD example we have above, notice that Col-
umn 1 of U gets scaled by the first value from Σ.
6. (5) implies we simply need the columns of US, both matrices that
are surfaced by SVD
• Web search engines also utilize PCA. There are billions of pages on
the Internet that may have a non-trivial relation to a provided
search phrase. Companies such as Google, Bing and Yahoo
typically narrow the search space by only considering a small
subset of this search matrix, which can be extracted using PCA34 . 34
Snasel V. Abdulla H.D. Search
This is critical for timely and efficient searches, and it speaks to the result clustering using a singular value
decomposition (svd). 2009
power of SVD.
138 computer vision: foundations and applications
Neuroscience Background
In the 1960’s and 1970’s, neuroscientists discovered that depending
on the angle of observation, certain brain neurons fire when look-
ing at a face. More recently, they have come to believe that an area
of the brain known as the Fusiform Face Area (FFA) is primarily
responsible for reacting to faces. These advances in the biological
understanding of facial recognition have been mirrored by similar
advances in computer vision, as new techniques have attempted to
come closer to the standard of human facial recognition.
Applications
Computer facial recognition has a wide range of applications:
• Digital Photography: Identifying specific faces in an image allows
programs to respond uniquely to different individuals, such as
centering the image focus on a particular individual or improving
aesthetics through various image operations (blur, saturation, etc).
Space of Faces
If we consider an m × n image of a face, that image can be represented
by a point in high dimensional space (Rmn ). But relatively few high-
dimensional vectors consist of valid face images (images can contain
much more than just faces), and thus the region that an arbitrary face
image could fall into is a relatively small subspace. The task is to
effectively model this subspace of face images.
Training Algorithm
1
µ=
N ∑ xi
3: Compute the difference image (the centered data matrix):
1 1
Xc = X − µ1T = X − X11T = X (1 − 11T )
N N
4: Compute the covariance matrix:
1
Σ= Xc Xc T
N
5: Compute the eigenvectors of the covariance matrix Σ using PCA
(Principle Components Analysis)
6: Compute each training image xi ’s projections as
xi → ( xi c · φ1 , xi c · φ2 , ..., xi c · φk ) ≡ ( a1 , a2 , ..., ak )
Testing Algorithm
Advantages
Disadvantages
General Idea
LDA operates using two values: between class scatter and within
class scatter. Between class scatter is concerned with the distance
between different class clusters, whereas within class scatter refers to
the distance between points of a class. LDA maximizes the between-
class scatter and minimizes the within-class scatter.
µ i = E X |Y [ X | Y = i ]
Let us also define a variable Σi that represents the covariance
matrix of a class:
Σi = EX |Y [( X − µi )( X − µi ) T |Y = i ]
Using these values, we can redefine our variables SB and SW to be:
∇w L = 2(SB − λSW )w = 0
Using this equation, we get that the critical points are located at:
SB w = λSW w
This is a generalized eigenvector problem. In the case where
−1
SW = (Σ1 + Σ0 )−1 exists, we obtain:
−1
SW SB w = λw
We can then plug in our definition of SB to get:
−1
SW (µ1 − µ0 )(µ1 − µ0 )T w = λw
face identification 147
−1 λ
SW ( µ1 − µ0 ) = w
α
The magnitude of w does not matter, so we can represent our
projection w as:
−1
w∗ = SW ( µ 1 − µ 0 ) = ( Σ 1 − Σ 0 ) −1 ( µ 1 − µ 0 )
• N sample images: { x1 , · · · , x N }
N
• Average of all data: µ = 1
N ∑ xk
k =1
Scatter Matrices:
• Scatter of class i: Si = ∑ ( xk − µi )( xk − µi )T
xk ∈Yi
c
• Within class scatter: Sw = ∑ Si
i =1
c
• Between class scatter: Sb = ∑ Ni (µi − µ)(µi − µ)T
i =1
z = w T x, x ∈ Rm , z ∈ Rn
SB wi = λi SW wi , i = 1, . . . , m
Rank(SB ) ≤ C − 1
Rank(SW ) ≤ N − C
Introduction
apply the "Bag of Words" model towards images and use this for
prediction tasks such as image classification and face detection.
There are two main steps for the "Bag of Words" method when
applied to computer vision, and these will further be explored in the
Outline section below.
1. Build a "dictionary" or "vocabulary" of features across many
images - what kinds of common features exist in images? We can
consider, for example, color scheme of the room, parts of faces such
as eyes, and different types of objects.
2. Given new images, represent them as histograms of the features
we had collected - frequencies of the visual "words" in the vocabulary
we have built.
Origins
The origins of applying the "Bag of Words" model to images comes
from Texture Recognition and, as previously mentioned, Document
Representation.
1. Textures consist of repeated elements, called textons - for ex-
ample, a net consists of repeated holes and a brick wall consists of
repeated brick pieces. If we were to consider each texton a feature,
then each image could be represented as a histogram across these
features - where the texton in the texture of the image would have
high frequency in the histogram. Images with multiple textures,
therefore, can be represented by histograms with high values for
multiple features.
2. Documents consist of words which can be considered their
features. Thus, every document is represented by a histogram across
the words in the dictionary - one would expect, for example, the
document of George Bush’s state of the union address in 2001 to
contain high relative frequencies for "economy", "Iraq", "army", etc.
Thus, a "bag of words" can be viewed as a histogram representing
frequencies across a vocabulary developed over a set of images or
documents - new data then can be represented with this model and
used for prediction tasks.
Algorithm Summary
mon features across our dataset of images. We can choose any type
of feature we want to find our features. For example, we can sim-
ply split our images into a grid and grab the subimages as features
(shown below). Or, we can use corner detection of SIFT features as
our features.
Once we have our features, we must turn this large feature set into a
small set of "themes". These "themes" are analogous to the "words"
in the Natural Language Processing version of the algorithm. As
mentioned above, in the Computer Vision application, the "words"
are called textons.
To find textons, we simply cluster our features. We can use any
clustering technique (K-Means is most common, but Mean Shift or
HAC may also work) to cluster the features. We then use the centers
of each cluster as the textons. Our set of textons is known as a visual
vocabulary. An example of a visual vocabulary is given below.
154 computer vision: foundations and applications
Representing our images as a histogram of texton frequencies 38 Ranjay Krishna Juan Carlos Niebles.
38
NumDocs
IDF = log( )
NumDocs jappears
Motivation
Pyramids
If the BoWs of the upper part of the image contain "sky visual we understand the Spatial Pyramid
Matching? https://www.quora.com/
words", the BoWs in the middle "vegetation and mountains visual How-should-we-understand-the-Spatial-Pyramid-Match
words" and the BoWs at the bottom "mountains visual words", then it [Online; accessed 15-Nov-2017]
is very likely that the image scene category is "mountains".
158 computer vision: foundations and applications
Some results
Naive Bayes
Basic Idea
Prior
P(c) denotes that probability of encountering one object class versus
others. For all m object classes, we then have
m
∑ P(c) = 1
i =1
Posterior
Using the prior equation, we can now calculate the probability than
the image represented by histogram x belongs to class category c
using Bayes Theorem
P(c) P( x |c)
P(c| x ) =
∑c0 P ( c0 ) P ( x | c0 )
Expanding the numerator and denominator, we can rewrite the
previous equation as
P(c) ∏im=1 P( xi |c)
P(c| x ) =
∑c0 P(c0 ) ∏im=1 P( xi |c0 )
Classification
In order to classify the image represented by histogram x, we simply
find the class c∗ that maximizes the previous equation:
c∗ = argmax c P(c| x )
160 computer vision: foundations and applications
c∗ = argmax c logP(c| x )
and
P(c2 ) ∏im=1 P( xi |c2 )
P ( c2 | x ) =
∑c0 P(c0 ) ∏im=1 P( xi |c0 )
Since the denominators are identical, we can ignore it when calculat-
ing the maximum. Thus
m
P ( c1 | x ) ∝ P ( c1 ) ∏ P ( x i | c1 )
i =1
and
m
P ( c2 | x ) ∝ P ( c2 ) ∏ P ( x i | c2 )
i =1
c∗ = argmax c P(c| x )
c∗ = argmax c logP(c| x )
m
c∗ = argmax c logP(c) + ∑ logP( xi |c)
i =1
Object Detection from Deformable Parts
Challenges
Object detection, however, faces many challenges. The challenges in-
clude the varying illumination conditions, changes in the viewpoints,
object deformations, and intra-class variability; this makes objects of
the same category appear different and makes it difficult to correctly
detect and classify objects. In addition, the algorithms introduced
herein only give the 2D location of the object in the image and not
the 3D location. For example, the algorithms cannot determine if an
162 computer vision: foundations and applications
PASCAL VOC
The first widely used benchmark was the PASCAL VOC Challenge
42 , or the Pattern Analysis, Statistical Modeling, and Computational 42
M. Everingham, L. Van Gool,
Learning Visual Object Classes challenge. The PASCAL VOC chal- C. K. I. Williams, J. Winn, and
A. Zisserman. The PASCAL Visual
lenge was used from 2005 to 2012 and tested 20 categories. PASCAL Object Classes Challenge 2012
was regarded as a high quality benchmark because its test categories (VOC2012) Results. http://www.pascal-
network.org/challenges/VOC/voc2012/workshop/index.h
had high variability within each category. Each test image also had
bounding boxes for all objects of interest like cars, people, cats, etc.
PASCAL also had annual classification, detection, and segmentation
challenges.
is less than 0.5 (Figure 68b). False positives are also referred to as
false alarms.
3. False Negative (FN)
False negatives are ground truth objects that our model does not
find (Figure 68b). These can also be referred to as misses.
4. True Negative (TN)
True negatives are anywhere our algorithm didnâĂŹt produce a
box and the annotator did not provide a box. True negatives are also
called correct rejections.
TP
Precision =
TP + FP
Precision can be thought of as the fraction of correct object predic-
object detection from deformable parts 165
TP
Recall =
TP + FN
Recall can be thought of as the fraction of ground truth objects that
are correctly detected by the model.
(c)
(d)
Classifying Windows
the car may be longer) so we want a new detection model that can
handle these situations. Recall the bag of words approach, in which
we represent a paragraph as a set of words, or an image as a set of
image parts. We can apply a similar idea here and detect an object
by its parts instead of detecting the whole singular object. Even if
the shape is slightly altered, all of the parts will be present and in
approximately the correct position with some minor variance.
(a)
(b)
( F0 , P1 , P2 , ...Pn , b)
where F0 is the root filter, P1 is the model for the first part, and b is a
bias term. Breaking it down further, each part’s model Pi is defined
by a tuple
( Fi , vi , di )
where Fi is the filter for the i-th part, vi is the "anchor" position for
part i relative to the root position, and di defines the deformation
cost for each possible placement of the part relative to the anchor
position.
We can calculate the location of the global filter and each part filter
with a HOG pyramid (see Figure 82). We run the global HOG filter
and each part’s HOG filter over the image at multiple scales so that
the model is robust to changes in scale. The location of each filter is
the location where we see the strongest response. Since we are taking
the location of responses across multiple scales we have to take care
that our description of the location of each part is scale-invariant
(one way this can be done is by scaling the maximum response map
for each part up to the original image size and then taking scale-
invariant location).
Recall that the score for each filter is the inner product of the filter
(as a vector) and φ( pi , H ) (defined as the HOG feature vector of a
window defined by position pi of the filter). Note that the windows
can be visualized in the HOG pyramid in Figure 82: the window for
the root is the cyan bounding box, and the window for each of the
parts is the yellow bounding box corresponding to that part. We are
taking the HOG feature vector of the portion of the image enclosed in
these bounding boxes and seeing how well it matches with the HOG
features of the template for that part.
Returning to the score formula, the right term represents the sum
of the deformation penalties for each part. We have di representing
the weights of each penalty for part i, corresponding to quantities
dxi (the distance in x direction from the anchor point where the part
should be), dyi (the distance in y direction from the anchor point
where the part should be), as well as dxi2 and dy2i . As an example, if
di = (0, 0, 1, 0), then the deformation penalty for part i is the square
of the distance in the x direction of that part from the anchor point.
All other measures of distance are ignored.
n
∏ Fi · φ( pi , H )
i =0
object detection from deformable parts 175
3. Having applied the parts filter, we now calculate the spatial costs
(i.e., a measure of the deformation of the parts with regards to the
global):
n
∑ di (dxi , dyi , dxi2 , dy2i )
i =1
n n
F0 + ∏ Fi · φ( pi , H ) − ∑ di (dxi , dyi , dxi2 , dy2i )
i =1 i =1
parts are related to each other because they are spatially close (i.e.,
they fit the deformation model), and that they correspond to the car
identified as the global filter – when in reality, there is not one but
two cars in the image providing the parts.
Similarly, with the bottom image, DPM indeed detects an object
very close to a car. However, since it does not take into account that
the object is even closer to being a bus than a car, and does not take
into account features explicitly not present in a car (e.g., the raised
roof spelling "School bus"), DPM results in a wrong detection.
DPM Summary
Approach
each part
Advantages
Disadvantages
Introduction
Since models are limited by the number of the classes in the training
set, one possible method of advancing universal class recognition is
to create datasets with many more than the 20 classes the PASCAL
VOC provided.
ImageNet is a large image database created by Jia Deng in 2009
that has 22,000 categories and 14 million images. It has become so im-
semantic hierarchies and fine grained recognition 183
We see from this matrix that models are more prone to making
classification errors when presented with items whose categories are
very similar; for example, models struggle with discerning between
different types of aquatic birds, while it can more easily categorize
birds versus dogs. In other words, distinguishing between “fine-
grained” categories is very difficult, since the distance between those
categories is much smaller than between larger categories.
Semantic Hierarchy
One method for solving the issue of correctly classifying similar
classes without making wrong guesses is the idea of a “semantic
semantic hierarchies and fine grained recognition 185
1. Pick a λ.
Fine-grained Classes
Existing work selects features from all possible locations in an image,
but it can fail to find the right feature. For example, the difference
between the Cardigan Welsh Corgi and the Pembroke Welsh Corgi
is the tail. The computer may be unable to discover that this is the
distinguishing feature. The easiest way to solve this type of problem
is through crowd-sourcing.
What seems like a simple solution, however, is actually difficult
– what is the best method for asking a crowd which features distin-
guish classes of images?
Bubble study: In order to detect the difference between smiling
and neutral faces, we display only small bubbles of an image to
people in an experiment to determine which bubbles allow them
to detect when the image is of a person smiling. An example of
how the bubble method works can be seen in figure 5. However,
this method is costly and time consuming. Another idea is to ask
people to annotate images themselves (to simply point out where the
semantic hierarchies and fine grained recognition 187
Introduction
Optical Flow
Put simply, optical flow is the movement of pixels over time. The
goal of optical flow is to generate a motion vector for each pixel
in an image between t0 and t1 by looking at two images I0 and I1 .
By computing a motion vector field between each successive frame
in a video, we can track the flow of objects, or, more accurately,
"brightness patterns" over extended periods of time. However, it
is important to note that while optical flow aims to represent the
motion of image patterns, it is limited to representing the apparent
motion of these patterns. This nuanced difference is explained more
in depth in the Assumptions and Limitations section.
Small Motion Optical flow assumes that points do not move very
far between consecutive images. This is often a safe assumption, as
videos are typically comprised of 20+ frames per second, so motion
between individual frames is small. However, in cases where the
object is very fast or close to the camera this assumption can still
prove to be untrue. To understand why this assumption is necessary,
we must consider the Brightness Consistency equation defined
above. When trying to solve this equation, it is useful to linearize the
right side using a Taylor expansion. This yields
I ( x + u( x, y), y + v( x, y), t) ≈ I ( x, y, t − 1) + Ix · u( x, y) + Iy · v( x, y) + It
I ( x + u( x, y), y + v( x, y), t) ≈ I ( x, y, t − 1) + Ix · u( x, y) + Iy · v( x, y) + It
I ( x + u( x, y), y + v( x, y), t) − I ( x, y, t − 1) = Ix · u( x, y) + Iy · v( x, y) + It
Giving us
Ix · u + Iy · v + It ≈ 0
5 I · [u v] T + It = 0
Ignoring the meaning of this derivation for the moment, it is clear
that we do not have enough equations to find both u and v at every
single pixel. Assuming that pixels move together allows us to use
many more equations with the same [u v], making it possible to solve
for the motion of pixels in this neighborhood.
Lucas-Kanade
0 = It (pi ) + ∇ I (pi ) · [u v]
Ix (p1 ) Iy (p1 )
Ix (p2 ) Iy (p2 )
. ..
.
. .
Ix (p25 ) Iy (p25 )
This produces an overly-constrained system of linear equations of
the form Ad = b. Using a least squares method for solving over-
constrained systems, we reduce the problem to solving for d in
( A T A)d = A T b. More explicitly the system to solve is reduced to
" #" # " #
∑ Ix Ix ∑ Ix Iy u ∑ Ix It
=−
∑ Iy Ix ∑ Iy Iy v ∑ Iy It
AT A AT b
192 computer vision: foundations and applications
• A T A should be invertible
• A T A should be well-conditioned
i.e λ1 /λ2 should not be too large (for λ1 > λ2 )
Geometric Interpretation
It should be evident that the least squares system of equations above
produce a second moment matrix M = A T A. In fact, this is the
Harris matrix for corner detection.
" # " #
∑ Ix Ix ∑ Ix Iy I h i
T
A A= = ∑ x Ix Iy = ∑ ∇ I (∇ I )T = M
∑ Iy Ix ∑ Iy Iy Iy
We can relate the conditions above for solving the motion field [u v]
to tracking corners detected by the Harris matrix M. In particular, the
eigenvectors and eigenvalues of M = A T A relate to the direction and
magnitude of a possible edge in a region.
Using this interpretation, it is apparent that an ideal region for
Lucas-Kanade optical flow estimation is a corner. Visually, if λ1 and
λ2 are too small this means the region is too âĂIJflatâĂİ. If λ1 >> λ2 ,
the method suffers from the aperture problem, and may fail to solve
for correct optical flow.
Error in Lucas-Kanade
The Lucas-Kanade method is constrained under the assumptions of
optical flow. Supposing that A T A is easily invertible and that there is
not much noise in the image, errors may still arise when:
• The motion is not small or and does not change gradually over
time.
Improving Accuracy
From the many assumptions made above, Lucas-Kanade can improve
its accuracy by including the higher order terms previously dropped
in the Taylor expansion approximation for the brightness constancy
equation. This loosens the assumptions of small motion and more
accurately reflects optical flow. Now, the problem to be solved is:
Horn-Schunk
The first term of this energy function reflects the brightness constancy
assumption, which states that the brightness of each pixel remains
the same between frames, though the location of the pixel may
change. According to this assumption, Ix u + Iy v + It should be
zero. The square of this value is included in the energy function to
ensure that this value is as close to zero as possible, and thus u and v
comply with the brightness constancy assumption.
The second term of this energy function reflects the small motion
assumption, which states that the points move by small amounts be-
tween frames. The squares of the magnitudes of u and v are included
in the energy function to encourage smoother flow with only small
changes to the position of each point. The regularization constant α
is included to control smoothness, with larger values of α leading to
smoother flow.
To minimize the energy function, we take the derivative with
respect to u and v and set to zero. This yields the following two
equations
Ix ( Ix u + Iy v + It ) − α2 ∆u = 0
Iy ( Ix u + Iy v + It ) − α2 ∆v = 0
2 ∂2
where ∆ = ∂x
∂
2 + ∂y2
is called the Lagrange operator, which in practice
is computed as
∆u( x, y) = ū( x, y) − u( x, y)
where ū( x, y) is the weighted average of u in a neighborhood around
( x, y). Substituting this expression for the Lagrangian in the two
equations above yields
( Ix2 + α2 )u + Ix Iy v = α2 ū − Ix It
Ix Iy u + ( Iy2 + α2 )v = α2 v̄ − Iy It
which is a linear equation in u and v for each pixel.
Iterative Horn-Schunk
Since the solution for u and v for each pixel ( x, y) depends on the op-
tical flow values in a neighborhood around ( x, y), to obtain accurate
motion 195
Iy ( Ix ūk + Iy v̄k + It )
vk+1 = v̄k −
α2 + Ix2 + Iy2
where ūk and v̄k are the values for ū and v̄ calculated during the k’th
iteration, and uk+1 and vk+1 are the updated values for u and v for
the next iteration.
Smoothness Regularization
The smoothness regularization term ||∇u||2 + ||∇v||2 in the energy
function encourages minimizing change in optical flow between
nearby points. With this regularization term, in texture free regions
there is no optical flow, and on edges, points will flow to the nearest
points, solving the aperture problem.
Now, when we try to find the flow vector, the small motion condi-
tion is fulfilled, as the downsampled pixels move less from frame to
consecutive frame than pixels in the higher resolution image. Here is
another example from the slides using Lucas-Kanade with pyramids:
Notice how the flow vectors now point mostly in the same direc-
tion, indicating that the tree trunk is moving in a consistent direction.
Common Fate
idea that each pixel in a given segment of the image will move in
a similar manner. Our goal is to identify the image segments, or
"layers", that move together.
Identify Layers
We compute layers in an image by dividing the image into blocks and
grouping based on the similarity of their affine motion parameters.
For each block, finding the vector a that minimizes
Ix u( x, y) + Iy v( x, y) + It ≈ 0
u( x, y) = a1 + a2 x + a3 y
v( x, y) = a4 + a5 x + a6 y
From there, we map our parameter vectors ai into motion param-
eter space and perform k-means clustering on the affine motion
parameter vectors.
Definition
Visual tracking is the process of locating a moving object (or multiple
objects) over time in a sequence.
Objective
The objective of tracking is to associate target objects and estimate
target state over time in consecutive video frames.
Applications
Tracking has a variety of application, some of which are:
Feature Tracking
Definition
Feature tracking is the detection of visual feature points (corners,
textured areas, ...) and tracking them over a sequence of frames
(images).
Example
Tracking methods
Simple KanadeâĂŞLucasâĂŞTomasi feature tracker The KanadeâĂŞLu-
casâĂŞTomasi (KLT) feature tracker is an approach to feature extrac-
tion. KLT makes use of spatial intensity information to direct the
search for the position that yields the best match. Its algorithm is:
tracking 203
• If the patch around the new point differs sufficiently from the
old point, we discard these points.
2D Transformations
Types of 2D Transformations
There are several types of 2D transformations. Choosing the correct
2D transformations can depend on the camera (e.g. placement, move-
ment, and viewpoint) and objects. A number of 2D transformations
are shown in Figure . Examples of 2D transformations include:
204 computer vision: foundations and applications
Translation
Translational motion is the motion by which a body shifts from one
point in space to another. Assume we have a simple point m with
coordinates ( x, y). Applying a translation motion on m shifts it from
( x, y) to ( x 0 , y0 ) where
x 0 = x + b1
(17)
y0 = y + b2
We can write this as a matrix transformation using homogeneous
coordinates:
! ! x
x 0 1 0 b1
0 = y (18)
y 0 1 b2
1
Let W be the above transformation defined as:
!
1 0 b1
W ( x; p) = (19)
0 1 b2
!
b1
where the parameter vector is p = .
b2
tracking 205
Similarity Motion
x 0 = ax + b1
(21)
y0 = ay + b2
Affine motion
Affine motion includes scaling, rotation, and translation. We can
express this as the following:
x 0 = a1 x + a2 y + b1
(24)
y0 = a3 x + a4 y + b2
Problem formulation
Given a video sequence, find the sequence of transforms that maps
each frame to the next frame. Should be able to deal with arbitrary
types of motion, including object motion and camera/perspective
motion.
Approach
This approach differs from the simple KLT tracker by the way it links
frames: instead of using optical-flow to link motion vectors and track
motion, we directly solve for the relevant transforms using feature
data and linear approximations. This allows us to deal with more
complex (such as affine and projective) transforms and link objects
more robustly.
Steps:
3. Solve for the transform p that minimizes the error of the feature
description around x2 = W ( x; p) (your hypothesis for where the
feature’s new location is) in the next frame. In other words, solve
the equation
∑ [T (W (x; p)) − T (x)]2
x
Math
We can in fact analytically derive an approximation method for
finding p (in Step 3). Assume that you have an initial guess for p, p0 ,
and p = p0 + ∆p.
Now,
But using the Taylor approximation, we see that this error term is
roughly equal to :
∂W
E≈ ∑ [T (W (x; p0 )) + ∇T ∂p ∆p − T (x)]2
x
To minimize this term, we take the derivative with regard to p0
and set it equal to 0, then solve for p0 .
∂E ∂W ∂W
∂p
≈ ∑ [∇T ( ∂p )T ][T (W (x; p0 )) + ∇T ∂p ∆p − T (x)] = 0
x
∂W T
∆p = H −1 ∑ [∇ T ][ T ( x ) − T (W ( x; p0 ))]
x ∂p
T
, where H is ∑ x [∇ T ∂W ∂W
∂p ] [∇ T ∂p ]
By iteratively setting p0 = p0 + ∆p, we can eventually converge
on an accurate, error-minimizing value of p, which tells us what the
transform is.
208 computer vision: foundations and applications
http://www.ams.org/samplings/feature-column/fcarc-svd.
https://commons.wikimedia.org/wiki/file:broadway_tower_edit.jpg.
a.
https://commons.wikimedia.org/wiki/file:broadway_tower_edit.jpg.
b.
https://www.weirdoptics.com/hidden-lion-visual-optical-illusion.
Shai Avidan and Ariel Shamir. Seam carving for content-aware image
resizing. In ACM Transactions on graphics (TOG), volume 26, page 10.
ACM, 2007.