ML Week 4 To 10 PDF
ML Week 4 To 10 PDF
ML Week 4 To 10 PDF
You can pick up where you left off. Just join a new session and we’ll reset your deadlines.
Join a session
Non-linear Hypotheses
Performing linear regression with a complex set of data with many features is very unwieldy. Say you
wanted to create a hypothesis from three (3) features that included all the quadratic terms:
2
g(θ0 + θ1 x + θ2 x 1 x 2 + θ3 x 1 x 3
1
2
+θ4 x + θ5 x 2 x 3
2
2
+θ6 x )
3
That gives us 6 features. The exact way to calculate how many features for all polynomial terms is the
combination function with repetition: http://www.mathsisfun.com/combinatorics/combinations-
(n+r−1)!
permutations.html r!(n−1)! . In this case we are taking all two-element combinations of three features:
(3+2−1)!
(2!⋅(3−1)!)
= 4!
4
= 6. (Note: you do not have to know these formulas, I just found it helpful for
understanding).
(100+2−1)!
For 100 features, if we wanted to make them quadratic we would get (2⋅(100−1)!) = 5050 resulting new
features.
We can approximate the growth of the number of new features we get with all quadratic terms with
O(n2 /2). And if you wanted to include all cubic terms in your hypothesis, the features would grow
asymptotically at O(n3 ). These are very steep growths, so as the number of our features increase, the
number of quadratic or cubic features increase very rapidly and becomes quickly impractical.
Example: let our training set be a collection of 50 x 50 pixel black-and-white photographs, and our goal
will be to classify which ones are photos of cars. Our feature set size is then n = 2500 if we compare
every pair of pixels.
Now let's say we need to make a quadratic hypothesis function. With quadratic features, our growth is
O(n2 /2). So our total features will be about 25002 /2 = 3125000, which is very impractical.
Neural networks offers an alternate way to perform machine learning when we have complex
hypotheses with many features.
123
Neurons and the Brain
Neural networks are limited imitations of how our own brains work. They've had a big recent
resurgence because of advances in computer hardware.
There is evidence that the brain uses only one "learning algorithm" for all its different functions.
Scientists have tried cutting (in an animal brain) the connection between the ears and the auditory
cortex and rewiring the optical nerve with the auditory cortex to find that the auditory cortex literally
learns to see.
This principle is called "neuroplasticity" and has many examples and experimental evidence.
Model Representation I
Let's examine how we will represent a hypothesis function using neural networks.
At a very simple level, neurons are basically computational units that take input (dendrites) as electrical
input (called "spikes") that are channeled to outputs (axons).
In our model, our dendrites are like the input features x1 ⋯ xn , and the output is the result of our
hypothesis function:
In this model our x0 input node is sometimes called the "bias unit." It is always equal to 1.
1
In neural networks, we use the same logistic function as in classification: . In neural networks
1+e−θT x
however we sometimes call it a sigmoid (logistic) activation function.
Our "theta" parameters are sometimes instead called "weights" in the neural networks model.
Our input nodes (layer 1) go into another node (layer 2), and are output as the hypothesis function.
The first layer is called the "input layer" and the final layer the "output layer," which gives the final value
computed on the hypothesis.
We can have intermediate layers of nodes between the input and output layers called the "hidden
layer."
We label these intermediate or "hidden" layer nodes a20 ⋯ a2n and call them "activation units."
(j)
a = "activation" of unit i in layer j
i
(j)
Θ = matrix of weights controlling function mapping from layer j to layer j + 1
If we had one hidden layer, it would look visually something like:
124
[x0 x1 x2 x3 ] → [a(2)
1 a2 a3 ] → hθ (x)
(2) (2)
This is saying that we compute our activation nodes by using a 3×4 matrix of parameters. We apply
each row of the parameters to our inputs to obtain the value for one activation node. Our hypothesis
output is the logistic function applied to the sum of the values of our activation nodes, which have been
multiplied by yet another parameter matrix Θ(2) containing the weights for our second layer of nodes.
(j)
The +1 comes from the addition in Θ(j) of the "bias nodes," x0 and Θ0 . In other words the output
nodes will not include the bias nodes while the inputs will.
Example: layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of Θ(1) is going to be
4×3 where sj = 2 and sj+1 = 4, so sj+1 × (sj + 1) = 4 × 3.
Model Representation II
In this section we'll do a vectorized implementation of the above functions. We're going to define a new
(j)
variable zk that encompasses the parameters inside our g function. In our previous example if we
replaced the variable z for all the parameters we would get:
(2) (2)
a = g(z )
1 1
(2) (2)
a = g(z )
2 2
(2) (2)
a = g(z )
3 3
In other words, for layer j=2 and node k, the variable z will be:
We are multiplying our matrix Θ(j−1) with dimensions sj × (n + 1) (where sj is the number of our
(j−1)
activation nodes) by our vector a with height (n+1). This gives us our vector z (j) with height sj .
Now we can get a vector of our activation nodes for layer j as follows:
(j)
We can then add a bias unit (equal to 1) to layer j after we have computed a(j) . This will be element a0
and will be equal to 1.
We get this final z vector by multiplying the next theta matrix after Θ(j−1) with the values of all the
activation nodes we just got.
This last theta matrix Θ(j) will have only one row so that our result is a single number.
Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing as we
did in logistic regression.
Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting
and more complex non-linear hypotheses.
This will cause the output of our hypothesis to only be positive if both x1 and x2 are 1. In other words:
x 1 = 0 and x 2 = 0 then g(−30) ≈ 0
x 1 = 0 and x 2 = 1 then g(−10) ≈ 0
x 1 = 1 and x 2 = 0 then g(−10) ≈ 0
x 1 = 1 and x 2 = 1 then g(10) ≈ 1
So we have constructed one of the fundamental operations in computers by using a small neural
network rather than using an actual AND gate. Neural networks can also be used to simulate all the
other logical gates.
AN D :
(1)
Θ = [ −30 20 20 ]
N OR :
(1)
Θ = [ 10 −20 −20 ]
OR :
(1)
Θ = [ −10 20 20 ]
We can combine these to get the XNOR logical operator (which gives 1 if x1 and x2 are both 0 or both
1).
x0 (2)
⎡ ⎤ ⎡ a1 ⎤
(3)
x1 → → [a ] → h Θ (x)
(2)
⎣ ⎦ ⎣ ⎦
x2 a
2
For the transition between the first and second layer, we'll use a Θ( 1) matrix that combines the values
for AND and NOR:
(2) (1)
a = g(Θ ⋅ x)
(3)
h Θ (x) = a
And there we have the XNOR operator using two hidden layers!
Multiclass Classification
To classify data into multiple classes, we let our hypothesis function return a vector of values. Say we
wanted to classify our data into one of four final resulting classes:
(2) (3)
x0 ⎡a ⎤ ⎡a ⎤ h Θ (x)
⎡ ⎤ 0 0 ⎡ 1 ⎤
x1 ⎢ (2) ⎥ ⎢ (3) ⎥
⎢ ⎥ ⎢ h Θ (x)2 ⎥
⎢a ⎥ ⎢a ⎥
⎢ ⎥ → ⎢ ⎥ →
x2 ⎢ 1 ⎥ → ⎢ 1 ⎥ → ⋯ → ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ (2) (3) ⎢ h Θ (x)3 ⎥
⋯ ⎢ ⎥ ⎢ ⎥
a a
⎣ ⎦
2 2 ⎣ ⎦
⎣ ⎦ ⎣ ⎦ h Θ (x)
xn ⋯ ⋯ 4
Our final layer of nodes, when multiplied by its theta matrix, will result in another vector, on which we
will apply the g() logistic function to get a vector of hypothesis values.
Our resulting hypothesis for one set of inputs may look like:
hΘ (x) = [0010]
In which case our resulting class is the third one down, or hΘ (x)3 .
Our final value of our hypothesis for a set of inputs will be one of the elements in y.
128
EĞƵƌĂůEĞƚǁŽƌŬƐ͗
ZĞƉƌĞƐĞŶƚĂƚŝŽŶ
EŽŶͲůŝŶĞĂƌ
ŚLJƉŽƚŚĞƐĞƐ
DĂĐŚŝŶĞ>ĞĂƌŶŝŶŐ
EŽŶͲůŝŶĞĂƌůĂƐƐŝĨŝĐĂƚŝŽŶ
džϮ
džϭ
ƐŝnjĞ
ηďĞĚƌŽŽŵƐ
ηĨůŽŽƌƐ
ĂŐĞ
ŶĚƌĞǁEŐ
tŚĂƚŝƐƚŚŝƐ͍
zŽƵƐĞĞƚŚŝƐ͗
ƵƚƚŚĞĐĂŵĞƌĂƐĞĞƐƚŚŝƐ͗
ŶĚƌĞǁEŐ
129
ŽŵƉƵƚĞƌsŝƐŝŽŶ͗ĂƌĚĞƚĞĐƚŝŽŶ
ĂƌƐ EŽƚĂĐĂƌ
dĞƐƚŝŶŐ͗
tŚĂƚŝƐƚŚŝƐ͍
ŶĚƌĞǁEŐ
ƉŝdžĞůϭ
>ĞĂƌŶŝŶŐ
ůŐŽƌŝƚŚŵ
ƉŝdžĞůϮ
ZĂǁŝŵĂŐĞ
ƉŝdžĞůϮ
ĂƌƐ ƉŝdžĞůϭ
͞EŽŶ͟ͲĂƌƐ ŶĚƌĞǁEŐ
ƉŝdžĞůϭ
>ĞĂƌŶŝŶŐ
ůŐŽƌŝƚŚŵ
ƉŝdžĞůϮ
ZĂǁŝŵĂŐĞ
ƉŝdžĞůϮ
ĂƌƐ ƉŝdžĞůϭ
͞EŽŶ͟ͲĂƌƐ ŶĚƌĞǁEŐ
130
ƉŝdžĞůϭ
>ĞĂƌŶŝŶŐ
ůŐŽƌŝƚŚŵ
ƉŝdžĞůϮ
ZĂǁŝŵĂŐĞ ϱϬdžϱϬƉŝdžĞůŝŵĂŐĞƐїϮϱϬϬƉŝdžĞůƐ
ƉŝdžĞůϮ ;ϳϱϬϬŝĨZ'Ϳ
ƉŝdžĞůϭŝŶƚĞŶƐŝƚLJ
ƉŝdžĞůϮŝŶƚĞŶƐŝƚLJ
ƉŝdžĞůϮϱϬϬŝŶƚĞŶƐŝƚLJ
EĞƵƌĂůEĞƚǁŽƌŬƐ͗
ZĞƉƌĞƐĞŶƚĂƚŝŽŶ
EĞƵƌŽŶƐĂŶĚ
ƚŚĞďƌĂŝŶ
DĂĐŚŝŶĞ>ĞĂƌŶŝŶŐ
131
EĞƵƌĂůEĞƚǁŽƌŬƐ
KƌŝŐŝŶƐ͗ůŐŽƌŝƚŚŵƐƚŚĂƚƚƌLJƚŽŵŝŵŝĐƚŚĞďƌĂŝŶ͘
tĂƐǀĞƌLJǁŝĚĞůLJƵƐĞĚŝŶϴϬƐĂŶĚĞĂƌůLJϵϬƐ͖ƉŽƉƵůĂƌŝƚLJ
ĚŝŵŝŶŝƐŚĞĚŝŶůĂƚĞϵϬƐ͘
ZĞĐĞŶƚƌĞƐƵƌŐĞŶĐĞ͗^ƚĂƚĞͲŽĨͲƚŚĞͲĂƌƚƚĞĐŚŶŝƋƵĞĨŽƌŵĂŶLJ
ĂƉƉůŝĐĂƚŝŽŶƐ
ŶĚƌĞǁEŐ
dŚĞ͞ŽŶĞůĞĂƌŶŝŶŐĂůŐŽƌŝƚŚŵ͟ŚLJƉŽƚŚĞƐŝƐ
ƵĚŝƚŽƌLJŽƌƚĞdž
ƵĚŝƚŽƌLJĐŽƌƚĞdžůĞĂƌŶƐƚŽƐĞĞ
ZŽĞĞƚĂů͕͘ϭϵϵϮ ŶĚƌĞǁEŐ
dŚĞ͞ŽŶĞůĞĂƌŶŝŶŐĂůŐŽƌŝƚŚŵ͟ŚLJƉŽƚŚĞƐŝƐ
^ŽŵĂƚŽƐĞŶƐŽƌLJŽƌƚĞdž
^ŽŵĂƚŽƐĞŶƐŽƌLJĐŽƌƚĞdžůĞĂƌŶƐƚŽƐĞĞ
^ĞĞŝŶŐǁŝƚŚLJŽƵƌƚŽŶŐƵĞ ,ƵŵĂŶĞĐŚŽůŽĐĂƚŝŽŶ;ƐŽŶĂƌͿ
EĞƵƌĂůEĞƚǁŽƌŬƐ͗
ZĞƉƌĞƐĞŶƚĂƚŝŽŶ
DŽĚĞů
ƌĞƉƌĞƐĞŶƚĂƚŝŽŶ/
DĂĐŚŝŶĞ>ĞĂƌŶŝŶŐ
133
EĞƵƌŽŶŝŶƚŚĞďƌĂŝŶ
ŶĚƌĞǁEŐ
EĞƵƌŽŶƐŝŶƚŚĞďƌĂŝŶ
ƌĞĚŝƚ͗h^EĂƚŝŽŶĂů/ŶƐƚŝƚƵƚĞƐŽĨ,ĞĂůƚŚ͕EĂƚŝŽŶĂů/ŶƐƚŝƚƵƚĞŽŶŐŝŶŐ ŶĚƌĞǁEŐ
EĞƵƌŽŶŵŽĚĞů͗>ŽŐŝƐƚŝĐƵŶŝƚ
^ŝŐŵŽŝĚ;ůŽŐŝƐƚŝĐͿĂĐƚŝǀĂƚŝŽŶĨƵŶĐƚŝŽŶ͘
ŶĚƌĞǁEŐ
134
EĞƵƌĂůEĞƚǁŽƌŬ
EĞƵƌĂůEĞƚǁŽƌŬ
͞ĂĐƚŝǀĂƚŝŽŶ͟ŽĨƵŶŝƚŝŶůĂLJĞƌ
ŵĂƚƌŝdžŽĨǁĞŝŐŚƚƐĐŽŶƚƌŽůůŝŶŐ
ĨƵŶĐƚŝŽŶŵĂƉƉŝŶŐĨƌŽŵůĂLJĞƌƚŽ
ůĂLJĞƌ
/ĨŶĞƚǁŽƌŬŚĂƐƵŶŝƚƐŝŶůĂLJĞƌ͕ƵŶŝƚƐŝŶůĂLJĞƌ͕ƚŚĞŶ
ǁŝůůďĞŽĨĚŝŵĞŶƐŝŽŶ͘
ŶĚƌĞǁEŐ
135
EĞƵƌĂůEĞƚǁŽƌŬƐ͗
ZĞƉƌĞƐĞŶƚĂƚŝŽŶ
DŽĚĞů
ƌĞƉƌĞƐĞŶƚĂƚŝŽŶ//
DĂĐŚŝŶĞ>ĞĂƌŶŝŶŐ
&ŽƌǁĂƌĚƉƌŽƉĂŐĂƚŝŽŶ͗sĞĐƚŽƌŝnjĞĚ ŝŵƉůĞŵĞŶƚĂƚŝŽŶ
ĚĚ͘
ŶĚƌĞǁEŐ
EĞƵƌĂůEĞƚǁŽƌŬůĞĂƌŶŝŶŐŝƚƐŽǁŶĨĞĂƚƵƌĞƐ
ŶĚƌĞǁEŐ
136
KƚŚĞƌŶĞƚǁŽƌŬĂƌĐŚŝƚĞĐƚƵƌĞƐ
ŶĚƌĞǁEŐ
EĞƵƌĂůEĞƚǁŽƌŬƐ͗
ZĞƉƌĞƐĞŶƚĂƚŝŽŶ
džĂŵƉůĞƐĂŶĚ
ŝŶƚƵŝƚŝŽŶƐ/
DĂĐŚŝŶĞ>ĞĂƌŶŝŶŐ
137
EŽŶͲůŝŶĞĂƌĐůĂƐƐŝĨŝĐĂƚŝŽŶĞdžĂŵƉůĞ͗yKZͬyEKZ
͕ĂƌĞďŝŶĂƌLJ;ϬŽƌϭͿ͘
džϮ
džϮ
džϭ
džϭ
ŶĚƌĞǁEŐ
^ŝŵƉůĞĞdžĂŵƉůĞ͗E ϭ͘Ϭ
Ϭ Ϭ
Ϭ ϭ
ϭ Ϭ
ϭ ϭ
ŶĚƌĞǁEŐ
džĂŵƉůĞ͗KZĨƵŶĐƚŝŽŶ
ͲϭϬ
ϮϬ Ϭ Ϭ
ϮϬ Ϭ ϭ
ϭ Ϭ
ϭ ϭ
ŶĚƌĞǁEŐ
138
EĞƵƌĂůEĞƚǁŽƌŬƐ͗
ZĞƉƌĞƐĞŶƚĂƚŝŽŶ
džĂŵƉůĞƐĂŶĚ
ŝŶƚƵŝƚŝŽŶƐ//
DĂĐŚŝŶĞ>ĞĂƌŶŝŶŐ
EĞŐĂƚŝŽŶ͗
Ϭ
ϭ
ŶĚƌĞǁEŐ
139
WƵƚƚŝŶŐŝƚƚŽŐĞƚŚĞƌ͗
ͲϯϬ ϭϬ ͲϭϬ
ϮϬ ͲϮϬ ϮϬ
ϮϬ ͲϮϬ ϮϬ
Ϭ Ϭ
Ϭ ϭ
ϭ Ϭ
ϭ ϭ
ŶĚƌĞǁEŐ
EĞƵƌĂůEĞƚǁŽƌŬŝŶƚƵŝƚŝŽŶ
ŶĚƌĞǁEŐ
,ĂŶĚǁƌŝƚƚĞŶĚŝŐŝƚĐůĂƐƐŝĨŝĐĂƚŝŽŶ
ŶĚƌĞǁEŐ
EĞƵƌĂůEĞƚǁŽƌŬƐ͗
ZĞƉƌĞƐĞŶƚĂƚŝŽŶ
DƵůƚŝͲĐůĂƐƐ
ĐůĂƐƐŝĨŝĐĂƚŝŽŶ
DĂĐŚŝŶĞ>ĞĂƌŶŝŶŐ
ŶĚƌĞǁEŐ
141
DƵůƚŝƉůĞŽƵƚƉƵƚƵŶŝƚƐ͗KŶĞͲǀƐͲĂůů͘
tĂŶƚ͕͕͕ĞƚĐ͘
ǁŚĞŶƉĞĚĞƐƚƌŝĂŶǁŚĞŶĐĂƌ ǁŚĞŶŵŽƚŽƌĐLJĐůĞ
ŶĚƌĞǁEŐ
DƵůƚŝƉůĞŽƵƚƉƵƚƵŶŝƚƐ͗KŶĞͲǀƐͲĂůů͘
tĂŶƚ͕͕͕ĞƚĐ͘
ǁŚĞŶƉĞĚĞƐƚƌŝĂŶǁŚĞŶĐĂƌ ǁŚĞŶŵŽƚŽƌĐLJĐůĞ
dƌĂŝŶŝŶŐƐĞƚ͗
ŽŶĞŽĨ͕ ͕͕
ƉĞĚĞƐƚƌŝĂŶĐĂƌ ŵŽƚŽƌĐLJĐůĞƚƌƵĐŬ
ŶĚƌĞǁEŐ
ŶĚƌĞǁEŐ
142
You can pick up where you left off. Just join a new session and we’ll reset your deadlines.
Join a session
Cost Function
Let's first define a few variables that we will need to use:
Recall that in neural networks, we may have many output nodes. We denote hΘ (x)k as being a
hypothesis that results in the k th output.
Our cost function for neural networks is going to be a generalization of the one we used for
logistic regression.
Recall that the cost function for regularized logistic regression was:
m K L−1
1 (i) (i) (i) (i)
λ
J (Θ) = − ∑ ∑[y log((hΘ (x )) ) + (1 − y ) log(1 − (hΘ (x )) )] + ∑
k k k k
m 2m
i=1 k=1 l=1
We have added a few nested summations to account for our multiple output nodes. In the first
part of the equation, between the square brackets, we have an additional nested summation
that loops through the number of output nodes.
In the regularization part, after the square brackets, we must account for multiple theta
matrices. The number of columns in our current theta matrix is equal to the number of nodes
in our current layer (including the bias unit). The number of rows in our current theta matrix is
143
equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic
regression, we square every term.
Note:
the double sum simply adds up the logistic regression costs calculated for each cell in the output
layer; and
the triple sum simply adds up the squares of all the individual Θs in the entire network.
Backpropagation Algorithm
"Backpropagation" is neural-network terminology for minimizing our cost function, just like
what we were doing with gradient descent in logistic and linear regression.
minΘ J(Θ)
That is, we want to minimize our cost function J using an optimal set of parameters in theta.
In this section we'll look at the equations we use to compute the partial derivative of J(Θ):
∂
(l)
J(Θ)
∂Θi,j
(l)
δj = "error" of node j in layer l
(l)
Recall that aj is activation node j in layer l.
For the last layer, we can compute the vector of delta values with:
δ (L) = a(L) − y
Where L is our total number of layers and a(L) is the vector of outputs of the activation units
for the last layer. So our "error values" for the last layer are simply the differences of our actual
results in the last layer and the correct outputs in y.
To get the delta values of the layers before the last layer, we can use an equation that steps us
back from right to left:
The full back propagation equation for the inner nodes is then:
A. Ng states that the derivation and proofs are complicated and involved, but you can still
implement the above equations to do back propagation without knowing the details.
We can compute our partial derivative terms by multiplying our activation values and our error
values for each training example t:
Note: δ l+1 and al+1 are vectors with sl+1 elements. Similarly, a(l) is a vector with sl elements.
Multiplying them produces a matrix that is sl+1 by sl which is the same dimension as Θ(l) . That
is, the process produces a gradient term for every element in Θ(l) . (Actually, Θ(l) has sl + 1
column, so the dimensionality is not exactly the same).
We can now take all these equations and put them together into a backpropagation algorithm:
(l)
Set Δi,j := 0 for all (l,i,j)
1
1 145
(l)
Di,j := (Δ(l)
i,j + λΘ(l)
i,j ) If j≠0 NOTE: Typo in lecture slide omits outside parentheses. This
m
version is correct.
(l) 1 (l)
Di,j := Δ If j=0
m i,j
The capital-delta matrix is used as an "accumulator" to add up our values as we go along and
eventually compute our partial derivative.
(l)
The actual proof is quite involved, but, the Di,j terms are the partial derivatives and the results
we are looking for:
(l) ∂J(Θ)
Di,j = (l)
.
∂Θi,j
Backpropagation Intuition
The cost function is:
m K L−1 sl sl +1
1 (t) (t)
λ (l) 2
(t) (t)
J (θ) = − ∑ ∑[y log(h θ (x )) + (1 − y ) log(1 − h θ (x ) )] + ∑ ∑ ∑ (θ )
k k k k j,i
m 2m
t=1 k=1 l=1 i=1 j=1
(l) (l)
Intuitively, δj is the "error" for aj (unit j in layer l)
More formally, the delta values are actually the derivative of the cost function:
(l) ∂
δj = (l)
cost(t)
∂zj
Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the
slope the more incorrect we are.
Note: In lecture, sometimes i is used to index a training example. Sometimes it is used to index
a unit in a layer. In the Back Propagation Algorithm described here, t is used to index a training
example rather than overloading the use of i.
146
Implementation Note: Unrolling
Parameters
With neural networks, we are working with sets of matrices:
In order to use optimizing functions such as "fminunc()", we will want to "unroll" all the
elements and put them into one long vector:
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back
our original matrices from the "unrolled" versions as follows:
1 Theta1 = reshape(thetaVector(1:110),10,11)
2 Theta2 = reshape(thetaVector(111:220),10,11)
3 Theta3 = reshape(thetaVector(221:231),1,11)
4
NOTE: The lecture slides show an example neural network with 3 layers. However, 3 theta
matrices are defined: Theta1, Theta2, Theta3. There should be only 2 theta matrices: Theta1 (10
x 11), Theta2 (1 x 11).
Gradient Checking
Gradient checking will assure that our backpropagation works as intended.
∂ J(Θ + ϵ) − J(Θ − ϵ)
J(Θ) ≈
∂Θ 2ϵ
With multiple theta matrices, we can approximate the derivative with respect to Θj as follows:
∂ J(Θ1 , … , Θj + ϵ, … , Θn ) − J(Θ1 , … , Θj − ϵ, … , Θn )
J(Θ) ≈
∂Θj 2ϵ
A good small value for ϵ (epsilon), guarantees the math above to become true. If the value be
much smaller, may we will end up with numerical problems. The professor Andrew usually uses
the value ϵ = 10−4 .
We are only adding or subtracting epsilon to the T hetaj matrix. In octave we can do it as
follows:
1 epsilon = 1e-4;
147
2 for i = 1:n,
3 thetaPlus = theta;
4 thetaPlus(i) += epsilon;
5 thetaMinus = theta;
6 thetaMinus(i) -= epsilon;
7 gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
8 end;
9
Once you've verified once that your backpropagation algorithm is correct, then you don't need
to compute gradApprox again. The code to compute gradApprox is very slow.
Random Initialization
Initializing all theta weights to zero does not work with neural networks. When we
backpropagate, all nodes will update to the same value repeatedly.
(l)
Initialize each Θij to a random value between[−ϵ, ϵ]:
6
ϵ=
Loutput + Linput
rand(x,y) will initialize a matrix of random real numbers between 0 and 1. (Note: this epsilon is
unrelated to the epsilon from Gradient Checking)
Putting it Together
First, pick a network architecture; choose the layout of your neural network, including how
many hidden units in each layer and how many layers total.
Defaults: 1 hidden layer. If more than 1 hidden layer, then the same number of units in every
hidden layer.
5. Use gradient checking to confirm that your backpropagation works. Then disable gradient
checking.
6. Use gradient descent or a built-in optimization function to minimize the cost function with the
weights in theta.
When we perform forward and back propagation, we loop on every training example:
1 for i = 1:m,
2 Perform forward propagation and backpropagation using example (x(i),y(i))
3 (Get activations a(l) and delta terms d(l) for l = 2,...,L
It will also explain how the images are converted thru several formats to be
processed and displayed.
Introduction
The classifier provided expects 20 x 20 pixels black and white images converted in a row vector
of 400 real numbers like this
Each pixel is represented by a real number between -1.0 to 1.0, meaning -1.0 equal black and
1.0 equal white (any number in between is a shade of gray, and number 0.0 is exactly the
middle gray).
1 Image3DmatrixRGB = imread("myOwnPhoto.jpg");
A common way to convert color images to black & white, is to convert them to a YIQ standard
and keep only the Y component that represents the luma information (black & white). I and Q
represent the chrominance information (color).Octave has a function rgb2ntsc() that outputs a
similar three-dimensional matrix but of real numbers from -1.0 to 1.0, representing the height x
width x 3 (Y luma, I in-phase, Q quadrature) intensity for each pixel.
1 Image3DmatrixYIQ = rgb2ntsc(MyImageRGB);
To obtain the Black & White component just discard the I and Q matrices. This leaves a two-
dimensional matrix of real numbers from -1.0 to 1.0 representing the height x width pixels
black & white values.
1 Image2DmatrixBW = Image3DmatrixYIQ(:,:,1);
It is useful to crop the original image to be as square as possible. The way to crop a matrix is by
selecting an area inside the original B&W image and copy it to a new matrix. This is done by
selecting the rows and columns that define the area. In other words, it is copying a rectangular
subset of the matrix like this:
Cropping does not have to be all the way to a square.It could be cropping just a percentage
of the way to a squareso you can leave more of the image intact. The next step of scaling will
take care of streaching the image to fit a square.
Scaling to 20 x 20 pixels
The classifier provided was trained with 20 x 20 pixels images so we need to scale our photos
to meet. It may cause distortion depending on the height and width ratio of the cropped
original photo. There are many ways to scale a photo but we are going to use the simplest one.
151
We lay a scaled grid of 20 x 20 over the original photo and take a sample pixel on the center of
each grid. To lay a scaled grid, we compute two vectors of 20 indexes each evenly spaced on
the original size of the image. One for the height and one for the width of the image. For
example, in an image of 320 x 200 pixels will produce to vectors like
Copy the value of each pixel located by the grid of these indexes to a new matrix. Ending up
with a matrix of 20 x 20 real numbers.
The classifier provided was trained with images of white digits over gray background.
Specifically, the 20 x 20 matrix of real numbers ONLY range from 0.0 to 1.0 instead of the
complete black & white range of -1.0 to 1.0, this means that we have to normalize our photos to
a range 0.0 to 1.0 for this classifier to work. But also, we invert the black and white colors
because is easier to "draw" black over white on our photos and we need to get white digits. So
in short, we invert black and white and stretch black to gray.
Rotation of image
Some times our photos are automatically rotated like in our celular phones. The classifier
provided can not recognize rotated images so we may need to rotate it back sometimes. This
can be done with an Octave function rot90() like this.
Where rotationStep is an integer: -1 mean rotate 90 degrees CCW and 1 mean rotate 90
degrees CW.
Approach
1. The approach is to have a function that converts our photo to the format the classifier is
expecting. As if it was just a sample from the training data set.
Read the file as a RGB image and convert it to Black & White 2D matrix (see the introduction).
Obtain the origin and amount of the columns and rows to be copied to the cropped image.
Compute the scale and compute back the new size. This last step is extra. It is computed back
so the code keeps general for future modification of the classifier size. For example: if changed
from 20 x 20 pixels to 30 x 30. Then the we only need to change the line of code where the
scale is computed.
Copy just the indexed values from old image to get new image of 20 x 20 real numbers. This is
called "sampling" because it copies just a sample pixel indexed by a grid. All the sample pixels
make the new image.
1 % Copy just the indexed values from old image to get new image
2 newImage = croppedImage(rowIndex,colIndex,:);
3
Rotate the matrix using the rot90() function with the rotStep parameter: -1 is CCW, 0 is no
rotate, 1 is CW.
Invert black and white because it is easier to draw black digits over white background in our
photos but the classifier needs white digits.
Find the min and max gray values in the image and compute the total value range in
preparation for normalization.
Do normalization so all values end up between 0.0 and 1.0 because this particular classifier do
not perform well with negative numbers.
Add some contrast to the image. The multiplication factor is the contrast control, you can
increase it if desired to obtain sharper contrast (contrast only between gray and white, black
was already removed in normalization).
1
154
% Add contrast. Multiplication factor is contrast control.
2 contrastedImage = sigmoid((normImage -0.5) * 5);
3
Show the image specifying the black & white range [-1 1] to avoid automatic ranging using the
image range values of gray to white. Showing the photo with different range, does not affect
the values in the output matrix, so do not affect the classifier. It is only as a visual feedback for
the user.
Finally, output the matrix as a unrolled vector to be compatible with the classifier.
End function.
1 end;
Usage samples
Single photo
No cropping
Multiple photos
First crop to square and second 25% of the way to square photo
155
First no rotation and second CW rotationvectorImage(1,:) =
imageTo20x20Gray('myFirstDigit.jpg',100); vectorImage(2,:) =
imageTo20x20Gray('mySecondDigit.jpg',25,1); predict(Theta1, Theta2, vectorImage)
Tips
JPG photos of black numbers over white background
Rotate as needed because the classifier can only work with vertical digits
Leave background space around digit. Al least 2 pixels when seen at 20 x 20 resolution. This
means that the classifier only really works in a 16 x 16 area.
Photo Gallery
Digit 2
Digit 6
158
Digit 6 inverted is digit 9. This is the same photo of a six but rotated. Also, changed
the contrast multiplier from 5 to 20. You can note that the gray background is
smoother.
159
Digit 3
160
And the equation to compute partial derivatives of the theta terms in the [last] hidden layer
neurons (layer L-1):
∂J(θ) ∂J(θ) ∂a(L) ∂z (L) ∂a(L−1) ∂z (L−1)
∂θ(L−2)
= ∂a(L) ∂z (L) ∂a(L−1) ∂z (L−1) ∂θ(L−2)
Clearly they share some pieces in common, so a delta term (δ (L) ) can be used for the common
pieces between the output layer and the hidden layer immediately before it (with the possibility
that there could be many hidden layers if we wanted):
∂J(θ) ∂a(L)
δ (L) = ∂a(L) ∂z (L)
161
And we can go ahead and use another delta term (δ (L−1) ) for the pieces that would be shared by
the final hidden layer and a hidden layer before that, if we had one. Regardless, this delta term
will still serve to make the math and implementation more concise.
∂J(θ) ∂a(L) ∂z (L) ∂a(L−1)
δ (L−1) = ∂a(L) ∂z (L) ∂a(L−1) ∂z (L−1)
(L)
∂a(L−1)
δ (L−1) = δ (L) ∂a∂z(L−1) ∂z (L−1)
∂θ(L−1)
= δ (L) ∂θ∂z(L−1)
∂J(θ) (L−1)
∂θ(L−2)
= δ (L−1) ∂z
∂θ(L−2)
∂θ(L−1)
= δ (L) ∂θ∂z(L−1)
∂J(θ) ∂a(L)
Using δ (L) = ∂a(L) ∂z (L)
, we need to evaluate both partial derivatives.
Given J(θ) = −ylog(a(L) ) − (1 − y)log(1 − a(L) ), where a(L) = hθ (x)), the partial
derivative is:
∂J(θ) 1−y y
∂a(L)
= 1−a(L)
− a(L)
1
And given a=g(z), where g = 1+e−z
, the partial derivative is:
∂a(L)
∂z (L)
= a(L) (1 − a(L) )
So, let's substitute these in for δ (L) :
∂J(θ) ∂a(L)
δ (L) = ∂a(L) ∂z (L)
1−y
δ (L) = ( 1−a(L) −
y
a(L)
)(a(L) (1 − a(L) ))
δ (L) = a(L) − y
δ (3) = a(3) − y
Now, given z=θ∗input, and in layer L the input is a(L−1) , the partial derivative is:
∂z (L)
∂θ(L−1)
= a(L−1)
∂J(θ) (L)
∂θ(L−1)
= δ (L) ∂θ∂z(L−1)
∂J(θ)
∂θ(L−1)
= (a(L) − y)(a(L−1) )
162
Let's continue on for the hidden layer (let's assume we only have 1 hidden layer):
∂J(θ) (L−1)
∂θ(L−2)
= δ (L−1) ∂z
∂θ(L−2)
∂z (L)
∂a(L−1)
= θ(L−1)
(L−1)
And: ∂a
∂z (L−1)
= a(L−1) (1 − a(L−1) )
So, let's substitute these in for δ (L−1) :
(L)
∂a(L−1)
δ (L−1) = δ (L) ∂a∂z(L−1) ∂z (L−1)
∂J(θ) (L−1)
∂θ(L−2)
= δ (L−1) ∂z
∂θ(L−2)
(L)
∂J(θ) ∂a(L−1)
∂θ(L−2)
= (δ (L) ∂a∂z(L−1) ∂z (L−1)
)(a(L−2) )
∂J(θ)
∂θ(L−2)
= ((a(L) − y)(θ(L−1) )(a(L−1) (1 − a(L−1) )))(a(L−2) )
There is only one output node, so you do not need the 'num_labels' parameter.
Since there is one linear output, you do not need to convert y into a logical matrix.
The non-linear function is often the tanh() function - it has an output range from -1 to +1, and its
gradient is easily implemented. Let g(z)=tanh(z).
The gradient of tanh is g ′ (z)
163
= 1 − g(z)2 . Use this in backpropagation in place of the sigmoid
gradient.
Remove the sigmoid function from the output layer (i.e. calculate a3 without using a sigmoid
function), since we want a linear output.
Cost computation: Use the linear cost function for J (from ex1 and ex5) for the unregularized
portion. For the regularized portion, use the same method as ex4.
Where reshape() is used to form the Theta matrices, replace 'num_labels' with '1'.
You still need to randomly initialize the Theta values, just as with any NN. You will want to
experiment with different epsilon values. You will also need to create a predictLinear() function,
using the tanh() function in the hidden layer, and a linear output.
1 % inputs
2 nn_params = [31 16 15 -29 -13 -8 -7 13 54 -17 -11 -9 16]'/ 10;
3 il = 1;
4 hl = 4;
5 X = [1; 2; 3];
6 y = [1; 4; 9];
7 lambda = 0.01;
8
9 % command
10 [j g] = nnCostFunctionLinear(nn_params, il, hl, X, y, lambda)
11
12 % results
13 j = 0.020815
14 g =
15 -0.0131002
16 -0.0110085
17 -0.0070569
18 0.0189212
19 -0.0189639
20 -0.0192539
21 -0.0102291
22 0.0344732
23 0.0024947
24 0.0080624
25 0.0021964
26 0.0031675
27 -0.0064244
28
Now create a script that uses the 'ex5data1.mat' from ex5, but without creating the polynomial
terms. With 8 units in the hidden layer and MaxIter set to 200, you should be able to get a final
cost value of 0.3 to 0.4. The results will vary a bit due to the random Theta initialization. If you
plot the training set and the predicted values for the training set (using your predictLinear()
function), you should have a good match.
σ(x)(1 − σ(x))
.&!(0$& '7
&##
$'(.#)$#
#&##
&!
#(!/""$0
0
($(!#$8$!2&'##(0$&
#$8$.#('>#$($.#)#'.#(?#
!2&
2&C 2&D 2&E 2&F
%'(&#&"$($&2!(&.
C$.(%.(.#( $.(%.(.#('
#&0
"#&$
$')&&''$#7
.&!#(0$& 7
#&0
166
.&!(0$& '7
&##
%&$%)$#
!$&("
#&##
!# &#$
$($$"%.(7
=
=
!# &#$
/#$#(&##1"%!>6?7
$&0&%&$%)$#7
#(.)$#7:&&$&;$#$#!2&8
$&$.(%.(.#(>!2&IF?
! $!#
&##'(
(>$&!!?8
$&
(
&$&"$&0&%&$%)$#($$"%.($&
'#6$"%.(
$"%.(
.&!(0$& '7
&##
%&$%)$#
#(.)$#
#&##
168
!(!! $
!(!! $
#&0
#" ! $-
$.'#$#'#!1"%!66('$C$.(%.(.#(6
##$&#&.!&3)$#>?6
Note: Mistake on lecture, it is supposed
to be 1-h(x).
1-
1
># $?
88$00!!'(#(0$& $#$#1"%!5
#&0
169
!(!! $
:&&$&;$$'($&>.#(#!2&?8
$&"!!26 >$&?60&
#&0
.&!(0$& '7
&##
"%!"#()$#
#$(7#&$!!#
%&"(&'
#&##
' $+$
function [jVal, gradient] = costFunction(theta)
9
optTheta = fminunc(@costFunction, initialTheta, options)
#&0
170
)
Theta1 = reshape(thetaVec(1:110),10,11);
Theta2 = reshape(thetaVec(111:220),10,11);
Theta3 = reshape(thetaVec(221:231),1,11);
#&0
!!#
/#)!%&"(&'8
#&$!!($(initialTheta($%''($
fminunc(@costFunction, initialTheta, options)
#&0
.&!(0$& '7
&##
&#( #
#&##
171
&!"$$!#"
!#!'#!
>88':.#&$!!;/&'$#$?
#&0
for i = 1:n,
thetaPlus = theta;
thetaPlus(i) = thetaPlus(i) + EPSILON;
thetaMinus = theta;
thetaMinus(i) = thetaMinus(i) – EPSILON;
gradApprox(i) = (J(thetaPlus) – J(thetaMinus))
/(2*EPSILON);
end;
((gradApproxJDVec
#&0
172
#$
#.
=
"%!"#( %&$%($$"%.(DVec>.#&$!!?8
=
"%!"#(#."&!&#( ($$"%.(gradApprox8
=
'.&(2/'"!&/!.'8
= .&#$&#( #8'# %&$%$$&!&##8
!##.
= '.&($'!2$.&&#( #$$&(&##
2$.&!''&8
2$.&.##."&!&#($"%.()$#$#
/&2(&)$#$&#('#(>$&#(##&!$$%$
costFunction(…)?2$.&$0!!/&2'!$08
#&0
.&!(0$& '7
&##
#$"
#)!3)$#
#&##
$'&
$&&#('#(#/#$%)"3)$#
"($6##)!/!.$&8
optTheta = fminunc(@costFunction,
initialTheta, options)
$#'&&#('#(
( initialTheta
= zeros(n,1)
5
#&0
173
!$+$
&.%(6%&"(&'$&&'%$##($#%.('$##($$
(0$#.#('&#)!8
#&0
$+$.
*#!*!
#)!3($&#$"/!.#
>88?
88
Theta1 = rand(10,11)*(2*INIT_EPSILON)
- INIT_EPSILON;
Theta2 = rand(1,11)*(2*INIT_EPSILON)
- INIT_EPSILON;
#&0
.&!(0$& '7
&##
.,#(
#&##
($(&
174
!&!#(!
#(0$& &((.&>$##)/(2%+&#(0##.&$#'?
$8$#%.(.#('7"#'$#$(.&'
$8$.(%.(.#('7."&$!'''
'$#!.!(7C#!2&6$&KC#!2&6/'"#$8$
#.#('#/&2!2&>.'.!!2("$&(+&?
#&0
!&!#(!
C8 #$"!2#)!30('
2
D8
"%!"#($&0&%&$%)$#($($
E8
"%!"#($($$"%.($'(.#)$#
F8
"%!"#( %&$%($$"%.(%&)!&/)/'
for i = 1:m
&$&"$&0&%&$%)$## %&$%)$#.'#
1"%!
>()/)$#'#!((&"'$&
?8
#&0
!&!#(!
G8 '&#( #($$"%&$"%.(.'#
%&$%)$#/'8.'##."&!')"($&#(
$8
#'!&#( #$8
H8 '&#('#($&/#$%)"3)$#"($0(
%&$%)$#($(&2($"#"3'.#)$#$
%&"(&'
#&0
175
#&0
.&!(0$& '7
&##
%&$%)$#
1"%!7.($#$"$.'
&/#>$%)$#!?
#&##
@$.&('2$#$"&!.A
176
You can pick up where you left off. Just join a new session and we’ll reset
your deadlines. Join a session
Increasing or decreasing λ
Don't just pick one of these avenues at random. We'll explore diagnostic
techniques for choosing one of the above solutions in the following
sections.
Evaluating a Hypothesis
A hypothesis may have low error for the training examples but still be
inaccurate (because of overfitting).
177
With a given dataset of training examples, we can split up the data into two
sets: a training set and a test set.
2mtest
2. For classification ~ Misclassification error (aka 0/1 misclassification error):
This gives us the proportion of the test data that was misclassified.
The error of your hypothesis as measured on the data set with which you
trained the parameters will be lower than any other data set.
In order to choose the model of your hypothesis, you can test each degree
of polynomial and look at the error result.
178
Without the Validation Set (note: this is a bad method - do not use it)
1. Optimize the parameters in Θ using the training set for each polynomial
degree.
2. Find the polynomial degree d with the least error using the test set.
3. Estimate the generalization error also using the test set with Jtest (Θ(d) ), (d
= theta from polynomial with lower error);
To solve this, we can introduce a third set, the Cross Validation Set, to
serve as an intermediate set that we can train d with. Then our test set will
give us an accurate, non-optimistic error.
One example way to break down our dataset into the three sets is:
We can now calculate three separate error values for the three different
sets.
With the Validation Set (note: this method presumes we do not also
use the CV set for regularization)
1. Optimize the parameters in Θ using the training set for each polynomial
degree.
2. Find the polynomial degree d with the least error using the cross validation
set.
3. Estimate the generalization error using the test set with Jtest (Θ(d) ), (d =
theta from polynomial with lower error);
179
This way, the degree of the polynomial d has not been trained using the
test set.
(Mentor note: be aware that using the CV set to select 'd' means that we
cannot also use it for the validation curve process of setting the lambda
value).
The training error will tend to decrease as we increase the degree d of the
polynomial.
At the same time, the cross validation error will tend to decrease as we
increase d up to a point, and then it will increase as d is increased, forming
a convex curve.
High bias (underfitting): both Jtrain (Θ) and JCV (Θ) will be high. Also,
JCV (Θ) ≈ Jtrain (Θ).
High variance (overfitting): Jtrain (Θ) will be low and JCV (Θ) will be
much greater thanJtrain (Θ).
The relationship of λ to the training set and the variance set is as follows:
Low λ: Jtrain (Θ) is low and JCV (Θ) is high (high variance/overfitting).
Intermediate λ: Jtrain (Θ) and JCV (Θ) are somewhat low and
Jtrain (Θ) ≈ JCV (Θ).
Large λ: both Jtrain (Θ) and JCV (Θ) will be high (underfitting /high bias)
The figure below illustrates the relationship between lambda and the
hypothesis:
181
3. Iterate through the λs and for each λ go through all the models to learn
some Θ.
4. Compute the cross validation error using the learned Θ (computed with
λ) on the JCV (Θ) without regularization or λ = 0.
5. Select the best combo that produces the lowest error on the cross
validation set.
6. Using the best combo Θ and λ, apply it on Jtest (Θ) to see if it has a good
generalization of the problem.
Learning Curves
Training 3 examples will easily have 0 errors because we can always find a
quadratic curve that exactly touches 3 points.
182
As the training set gets larger, the error for a quadratic function increases.
The error value will plateau out after a certain m, or training set size.
Low training set size: causes Jtrain (Θ) to be low and JCV (Θ) to be high.
Large training set size: causes both Jtrain (Θ) and JCV (Θ) to be high
with Jtrain (Θ)≈JCV (Θ).
Low training set size: Jtrain (Θ) will be low and JCV (Θ) will be high.
Large training set size: Jtrain (Θ) increases with training set size and
JCV (Θ) continues to decrease without leveling off. Also, Jtrain (Θ)<
JCV (Θ) but the difference between them remains significant.
Adding features
Decreasing λ
Increasing λ
Using a single hidden layer is a good starting default. You can train your
neural network on a number of hidden layers using your cross validation
set.
Model Selection:
Choosing M the order of polynomials.
Choose the model which best fits the data without overfitting (very difficult).
Jtrain (Θ) and JCV (Θ) both will be high and Jtrain (Θ) ≈ JCV (Θ)
Simple model => more rigid => does not change as much with changes in X
=> low variance, high bias.
One of the most important goals in learning: finding a model that is just
right in the bias-variance trade-off.
Regularization Effects:
Large values of λ pull weight parameters to zero leading to large bias =>
underfitting.
Lower-order polynomials (low model complexity) have high bias and low
variance. In this case, the model fits poorly consistently.
More training examples fixes high variance but not high bias.
The addition of polynomial and interaction features fixes high bias but not
high variance.
When using gradient descent, decreasing lambda can fix high bias and
increasing lambda can fix high variance (lambda is the regularization
parameter).
When using neural networks, small neural networks are more prone to
under-fitting and big neural networks are prone to over-fitting. Cross-
186
validation of network size is a way to choose alternatives.
Collect lots of data (for example "honeypot" project but doesn't always work)
Error Analysis
The recommended approach to solving machine learning problems is:
Plot learning curves to decide if more data, more features, etc. will help
You may need to process your input before it is useful. For example, if your
input is a set of words, you may want to treat the same word with different
forms (fail/failing/failed) as one word, so must use "stemming software" to
recognize them all as one.
187
Error Metrics for Skewed Classes
It is sometimes difficult to tell whether a reduction in error is actually an
improvement of the algorithm.
This usually happens with skewed classes; that is, when our class is very
rare in the entire data set.
Or to say it another way, when we have lot more examples from one class
than from the other class.
Precision: of all patients we predicted where y=1, what fraction actually has
cancer?
Recall: Of all the patients that actually have cancer, what fraction did we
correctly detect as having cancer?
True Positives
=
Total number of actual positives
True Positives
True Positives + False negatives
188
These two metrics give us a better sense of how our classifier is doing. We
want both precision and recall to be high.
Accuracy = truepositive+truenegative
totalpopulation
This way, we only predict cancer if the patient has a 70% chance.
Doing this, we will have higher precision but lower recall (refer to the
definitions in the previous section).
That way, we get a very safe prediction. This will cause higher recall but
lower precision.
The greater the threshold, the greater the precision and the lower the
recall.
189
The lower the threshold, the greater the recall and the lower the precision.
In order to turn these two metrics into one single number, we can take the
F value.
P +R
2
This does not work well. If we predict all y=0 then that will bring the average
up despite having 0 recall. If we predict all examples as y=1, then the very
high recall will bring up the average despite having 0 precision.
PR
F Score = 2
P +R
In order for the F Score to be large, both precision and recall must be large.
We want to train precision and recall on the cross validation set so as not
to bias our test set.
We must choose our features to have enough information. A useful test is:
Given input x, would a human expert be able to confidently predict y?
Rationale for large data: if we have a low bias algorithm (many features
or hidden units making a very complex function), then the larger the
training set we use, the less we will have overfitting (and the more accurate
the algorithm will be on the test set).
190
Quiz instructions
When the quiz instructions tell you to enter a value to "two decimal digits",
what it really means is "two significant digits". So, just for example, the
value 0.0123 should be entered as "0.012", not "0.01".
References:
https://class.coursera.org/ml/lecture/index
http://www.cedar.buffalo.edu/~srihari/CSE555/Chap9.Part2.pdf
http://blog.stephenpurpura.com/post/13052575854/managing-bias-variance-
tradeoff-in-machine-learning
http://www.cedar.buffalo.edu/~srihari/CSE574/Chap3/Bias-Variance.pdf
191
!(&,(!(
-5-.'-!(
,!),!/6!(3 ..)
3),%)(<*'
&--!/)(4'*&
!(,(!(
#
(,3
#
1*,2!-&,(!(=.1,-)'!&=-*'DKE),().-*'DJE=
.1,-< ))-KJJ3),-!(!/2)-*'A().-*'=
From: [email protected]
To: [email protected]
Subject: Buy now!
).<
(*,/:.%')-.,+1(.&5)1,,!(3),-DKJ:JJJ.)OJ:JJJE
!(.,!(!(-.:,. ,. ('(1&&5*!%KJJ3),-=
(,3
192
#
)3.)-*(5)1,/'.)'%!. 2&)3,,),9
B )&&.&).-).
B ==? )(5*).@*,)$.=
B 2&)* -)* !-/. .1,- - )( '!& ,)1/(
!(),'/)(D,)''!& ,E=
B 2&)* -)* !-/. .1,- ), '-- )5: == - )1&
?!-)1(.@(?!-)1(.-@.,.-. -'3),9 )3
)1.?&@(?&,@9.1,-)1.*1(.1/)(9
B 2&)* -)* !-/. &),!. ' .) .. '!--*&&!(- D==
'J,.:'K!(:3N. -=E
(,3
!(&,(!(
-5-.'-!(
,,),(&5-!-
!(,(!(
B .,.3!. -!'*&&),!. '. .5)1(!'*&'(.+1!%&5=
'*&'(.!.(.-.!.)(5)1,,)--B2&!/)(.=
B &).&,(!(1,2-.)!!'),.:'),.1,-:.=
,&!%&5.) &*=
B ,,),(&5-!-<(1&&54'!(. 4'*&-D!(,)--
2&!/)(-.E. .5)1,&),!. '',,),-)(=!5)1
-*).(5-5-.'/.,(!(3 ..5*)4'*&-!.!-
'%!(,,),-)(=
(,3
193
'
OJJ4'*&-!(,)--2&!/)(-.
&),!. ''!-&--!-KJJ'!&-=
(1&&54'!(. KJJ,,),-:(.),!6. '-)(<
D!E ..5*)'!&!.!-
D!!E . 1- D.1,-E 5)1 . !(% 3)1& 2 &* .
&),!. '&--!5. '),,.&5=
,'< &!,.'!--*&&!(-<
*&!A%< D'J,:'K!(:.=E
.&*--3),-< (1-1&'!&,)1/(<
. ,< (1-1&D-*''!(E*1(.1/)(<
(,3
#$#!
)1&!-)1(.A!-)1(.-A!-)1(.A!-)1(/(.,.-.
-'3),9
(1-?-.''!(@-)3,D==?),.,-.'',@E
1(!2,-A1(!2,-!.5=
,,),(&5-!-'5(). &*1&),!!(!. !-!-&!%&5.)!'*,)2
*,),'(=(&5-)&1/)(!-.).,5!.(-!!.3),%-=
(1',!&2&1/)(D==:,)--2&!/)(,,),E)&),!. '>-
*,),'(3!. (3!. )1.-.''!(=
!. )1.-.''!(< !. -.''!(<
!-/(1!- 1**,2-=&)3,-D)'A')'E<
(,3
!(&,(!(
-5-.'-!(
,,),'.,!-),
-%3&---
!(,(!(
194
!&
D
,!(&)!-/,,--!)(')&=D!(,:
). ,3!-E
!(. .5)1).KT,,),)(.-.-.=
DQQT),,.!()--E
(&5J=OJT)*/(.- 2(,=
function y = predictCancer(x)
y = 0; %ignore x!
return
(,3
-
!(*,-(),,&--.
.33(..)..
D&&*/(.-3 ,3*,!.:3 .
,/)(.1&&5 -(,9E
D&&*/(.-. ..1&&5 2(,:3 .,/)(
!3),,.&5..- 2!((,9E
(,3
!(&,(!(
-5-.'-!(
,!()*,!-!)(
(,&&
!(,(!(
195 .,1*)-!/2-
*,!-!)(U
()=)*,!.*)-!/2
.,1*)-!/2-
)!-/,,--!)(< ,&&U
()=).1&*)-!/2
,!.K!
,!.J!
1**)-33(..)*,!.D(,E
K
)(&5!2,5)((.=
,!-!)(
J=O
1**)-33(..)2)!'!--!(.))'(5
--)(,D2)!&-(/2-E=
J=O K
&&
,- )&=
),(,&&5<,!.K!.
(,3
0./
)3.))'*,*,!-!)(A,&&(1',-9
.
/ ./ $ 0
&),!. 'K J=O J=N J=NO J=NNN
&),!. 'L J=P J=K J=N J=KPO
&),!. 'M J=JL K=J J=OK J=JMQL
2,<
K),<
(,3
!(&,(!(
-5-.'-!(
.),' !(
&,(!(
!(,(!(
196
#''
==&--!5.3()(1-&3),-=
H.):.3):.))I:H. (:. (I
),,%-.
.CCCCC-=
1,5
&),!. '-
B ,*.,)(D)!-/,,--!)(E
B !(()3
B '),5B-
B
"25-
,!(!(-.-!6D'!&&!)(-E
+ * % %)
*% ),
F(%)(,!&&:LJJKG
!
--1'.1, --1!(.!(),'/)(.)
*,!.1,.&5=
4'*&<),,%-.
.CCCCC-=
)1(.,4'*&<,!. )1-!(*,!,)')(&5-!6
D.LE(()). ,.1,-=
!
-&,(!(&),!. '3!. '(5*,'.,-D==&)!-/
,,--!)(A&!(,,,--!)(3!. '(5.1,-;(1,&(.3),%
3!. '(5 !(1(!.-E=
-2,5&,.,!(!(-.D1(&!%&5.))2,.E
197
You can pick up where you left off. Just join a new session and we’ll reset your
deadlines. Join a session
Optimization Objective
The Support Vector Machine (SVM) is yet another type of supervised machine
learning algorithm. It is sometimes cleaner and more powerful.
m
1
(i) (i) (i) (i)
J (θ) = ∑ −y log(h θ (x )) − (1 − y ) log(1 − h θ (x ))
m
i=1
m
1 1 1
(i) (i)
= ∑ −y log ( ) − (1 − y ) log (1 − )
T T
m −θ
(i)
x −θ
(i)
x
i=1 1+e 1+e
To make a support vector machine, we will modify the first term of the cost function
1
− log(hθ (x)) = − log ( −θ T x ) so that when θ x (from now on, we shall refer
T
1+e
to this as z) is greater than 1, it outputs 0. Furthermore, for values of z less than 1, we
shall use a straight decreasing line instead of the sigmoid curve.(In the literature, this
is called a hinge loss (https://en.wikipedia.org/wiki/Hinge_loss) function.)
198
Similarly, we modify the second term of the cost function − log(1 − hθ(x) ) =
1
− log (1 − T ) so that when z is less than -1, it outputs 0. We also modify it
1 + e−θ x
so that for values of z greater than -1, we use a straight increasing line instead of the
sigmoid curve.
We shall denote these as cost1 (z) and cost0 (z) (respectively, note that cost1 (z) is
the cost for classifying when y=1, and cost0 (z) is the cost for classifying when y=0),
and we may define them as follows (where k is an arbitrary constant defining the
magnitude of the slope of the line):
z = θT x
m
J(θ) = m1 ∑i=1 y (i) (− log(hθ (x(i) ))) + (1 − y (i) )(− log(1 − hθ (x(i) ))) +
λ
∑nj=1 Θ2j
2m
Note that the negative sign has been distributed into the sum in the above equation.
We may transform this into the cost function for support vector machines by
substituting cost0 (z) and cost1 (z):
1 m λ n
J(θ) = ∑i=1 y (i) cost1 (θT x(i) ) + (1 − y (i) ) cost0 (θT x(i) ) + ∑j=1 Θ2j
m 2m
We can optimize this a bit by multiplying this by m (thus removing the m factor in the
denominators). Note that this does not affect our optimization, since we're simply
multiplying our cost function by a positive constant (for example, minimizing (u −
5)2 + 1 gives us 5; multiplying it by 10 to make it 10(u − 5)2 + 10 still gives us 5
when minimized).
m λ n
J(θ) = ∑i=1 y (i) cost1 (θT x(i) ) + (1 − y (i) ) cost0 (θT x(i) ) + ∑ Θ2
2 j=1 j
m 1 n
J(θ) = C ∑i=1 y (i) cost1 (θT x(i) ) + (1 − y (i) ) cost0 (θT x(i) ) + ∑j=1 Θ2j
2
1
This is equivalent to multiplying the equation by C = , and thus results in the same
λ
values when optimized. Now, when we wish to regularize more (that is, reduce
overfitting), we decrease C, and when we wish to regularize less (that is, reduce
underfitting), we increase C.
Finally, note that the hypothesis of the Support Vector Machine is not interpreted as
the probability of y being 1 or 0 (as it is for the hypothesis of logistic regression).
Instead, it outputs either 1 or 0. (In technical terms, it is a discriminant function.)
hθ (x) = {
1 if ΘT x ≥ 0
0 otherwise
200
Large Margin Intuition
A useful way to think about Support Vector Machines is to think of them as Large
Margin Classifiers.
Now when we set our constant C to a very large value (e.g. 100,000), our optimizing
function will constrain Θ such that the equation A (the summation of the cost of each
example) equals 0. We impose the following constraints on Θ:
m
∑i=1 y (i) cost1 (ΘT x) + (1 − y (i) )cost0 (ΘT x) = 0
n
1
2
J (θ) = C ⋅ 0 + ∑ Θj
2
j=1
n
1 2
= ∑Θ
j
2
j=1
Recall the decision boundary from logistic regression (the line separating the positive
and negative examples). In SVMs, the decision boundary has the special property that
it is as far away as possible from both the positive and the negative examples.
The distance of the decision boundary to the nearest example is called the margin.
Since SVMs maximize this margin, it is often called a Large Margin Classifier.
The SVM will separate the negative and positive examples by a large margin.
Data is linearly separable when a straight line can separate the positive and
negative examples.
If we have outlier examples that we don't want to affect the decision boundary, then
we can reduce C.
201
Increasing and decreasing C is similar to respectively decreasing and increasing λ, and
can simplify our decision boundary.
u1 v1
u = [ ] v = [ ]
u2 v2
The length of vector v is denoted ∣∣v∣∣, and it describes the line on a graph from
origin (0,0) to (v1 , v2 ).
The length of vector v can be calculated with v12 + v22 by the Pythagorean theorem.
The projection of vector v onto vector u is found by taking a right angle from u to the
end of v, creating a right triangle.
uT v = p ⋅ ∣∣u∣∣
Note that uT v = ∣∣u∣∣ ⋅ ∣∣v∣∣ cos θ where θ is the angle between u and v. Also, p =
∣∣v∣∣ cos θ . If you substitute p for ∣∣v∣∣ cos θ , you get uT v = p ⋅ ∣∣u∣∣.
So the product uT v is equal to the length of the projection times the length of vector
u.
uT v = v T u = p ⋅ ∣∣u∣∣ = u1 v1 + u2 v2
If the angle between the lines for v and u is greater than 90 degrees, then the
projection p will be negative.
n
202
1
2
min ∑Θ
j
Θ 2
j=1
1
2 2 2
= (Θ +Θ +⋯+Θ )
1 2 n
2
−−−−−−−−−−−−−−−−
1 2
2 2 2
= (√Θ 1 + Θ 2 + ⋯ + Θ n )
2
1 2
= ||Θ||
2
The reason this causes a "large margin" is because: the vector for Θ is perpendicular
to the decision boundary. In order for our optimization objective (above) to hold true,
we need the absolute value of our projections p(i) to be as large as possible.
Kernels I
Kernels allow us to make complex, non-linear classifiers using Support Vector
Machines.
Given x, compute new feature depending on proximity to landmarks l(1) , l(2) , l(3) .
(i) ≈ 02
If x ≈ l , then fi = exp(− )≈1
2σ 2
(i) (large number)2
If x is far from l , then fi = exp(− )≈0
2σ 2
In other words, if x and the landmark are close, then the similarity will be close to 1,
and if x and the landmark are far away from each other, the similarity will be close to
0.
(1)
l → f
1
(2)
l → f
2
(3)
l → f
3
h Θ (x) = Θ 1 f + Θ2 f + Θ3 f +…
1 2 3
Kernels II
One way to get the landmarks is to put them in the exact same locations as all the
training examples. This gives us m landmarks, with one landmark per training
example.
Given example x:
This gives us a "feature vector," f(i) of all our features for example x(i) . We may also
set f0 = 1 to correspond with Θ0 . Thus given training example x(i) :
x(i) →
[ f (i) = similarity(x(i) , l(1) )f (i) = similarity(x(i) , l(2) )⋮fm
(i)
= similarity(x(i) , l(m) ) ]
1 2
204
Now to get the parameters Θ we can use the SVM minimization algorithm but with
f (i) substituted in for x(i) :
Using kernels to generate f(i) is not exclusive to SVMs and may also be applied to
logistic regression. However, because of computational optimizations on SVMs,
kernels combined with SVMs is much faster than with other algorithms, so kernels are
almost always found combined only with SVMs.
The other parameter we must choose is σ 2 from the Gaussian Kernel function:
With a large σ 2 , the features fi vary more smoothly, causing higher bias and lower
variance.
With a small σ 2 , the features fi vary less smoothly, causing lower bias and higher
variance.
Using An SVM
There are lots of good SVM libraries already written. A. Ng often uses 'liblinear' and
'libsvm'. In practical application, you should use one of these libraries rather than
rewrite the functions.
Choice of parameter C
Note: not all similarity functions are valid kernels. They must satisfy "Mercer's
Theorem" which guarantees that the SVM package's optimizations run correctly and
do not diverge.
You want to train C and the parameters for the kernel function using the training and
cross-validation datasets.
Multi-class Classification
You can use the one-vs-all method just like we did for logistic regression, where y ∈
(1) (2) (i) T
1, 2, 3, … , K with Θ ,Θ , … , Θ(K). We pick class i with the largest (Θ ) x.
If n is large (relative to m), then use logistic regression, or SVM without a kernel (the
"linear kernel")
If n is small and m is large, then manually create/add more features, then use logistic
regression or SVM without a kernel.
In the first case, we don't have enough examples to need a complicated polynomial
hypothesis. In the second example, we have enough examples that we may need a
complex non-linear hypothesis. In the last case, we want to increase our features so
that logistic regression becomes applicable.
Note: a neural network is likely to work well for any of these situations, but may be
slower to train.
Additional references
"An Idiot's Guide to Support Vector Machines": http://web.mit.edu/6.034/wwwbob/svm-
notes-long-08.pdf
206
1''&)++&)
%*
',$ 6,&%
&!,2
%)% %
933%+9
99
933%+9
99
%)3
B3%+C:
B3%+C:
%)3
207
% "&"
& *,))** &%:
1''&)+2+&)$ %:
%)3
)"!!
5'&+* *:
%)3
1''&)++&)
%*
)) %
%+1 ,&%
%)% %
208
% ""
AD D AD D
933%+B%&+!1*+C
933%+B%&+!1*+C
%)3
!% )
%2):
AD D
%2):
AD D
%)3
!% )+ )! !
4E
4D
4E
4D
%)3
1''&)++&)
%*
$+$,*
%#)$) %
#** ,&%B&',&%#C
%)% %
%)3
210
!% )
%)3
!% )
%)3
1''&)++&)
%*
)%#*
%)% %
211
- !% )
4E
4D
2%9&$'1+%3+1)'% %
&%')&4 $ +5+&#%$)"*
4E
4D
%)3
! ")
%)3
212
(+
%)3
4E
4D
%)3
1''&)++&)
%*
)%#*
%)% %
213
!" !
2%:
4E
4D
) +
)+&+8
%)3
'" !
2%
&&*
2%4$'#:
&)+) % %4$'#:
%)3
'" !
5'&+* *: 2%9&$'1++1)*
) +>5FD?
) % %:
%)3
214
" !+
BC;):&3) *9 2) %;
$##: ) *9#&32) %;
):+1)*2)5$&)*$&&+#5;
) *9#&3)2) %;
$##:+1)*2)5#***$&&+#5;
&3) *9 )2) %;
%)3
1''&)++&)
%*
* %%
%)3
215
.! ")/%#!+
function f = kernel(x1,x2)
x1 x2
return
%)3
%5&A+A*#")%#*2 ##:
A %&$ #")%#:
%#-!!!!#
%)3
217
You can pick up where you left off. Just join a new session and we’ll reset your deadlines.
Join a session
ML:Clustering
In other words, we don't have the vector y of expected results, we only have a dataset of
features where we can find structure.
Market segmentation
K-Means Algorithm
The K-Means Algorithm is the most popular and widely used algorithm for automatically
grouping data into coherent subsets.
1. Randomly initialize two points in the dataset called the cluster centroids.
2. Cluster assignment: assign all examples into one of two groups based on which cluster centroid
the example is closest to.
3. Move centroid: compute the averages for all the points inside each of the two cluster centroid
groups, then move the cluster centroid points to those averages.
K (number of clusters)
Training set x(1) , x(2) , … , x(m)
218
Where x(i) ∈ Rn
The algorithm:
The first for-loop is the 'Cluster Assignment' step. We make a vector c where c(i) represents the
centroid assigned to example x(i).
We can write the operation of the Cluster Assignment step more mathematically as follows:
That is, each c(i) contains the index of the centroid that has minimal distance to x(i) .
By convention, we square the right-hand-side, which makes the function we are trying to
minimize more sharply increasing. It is mostly just a convention. But a convention that helps
reduce the computation load because the Euclidean distance requires a square root but it is
canceled.
...so the square convention serves two purposes, minimize more sharply and less computation.
The second for-loop is the 'Move Centroid' step where we move each centroid to the average
of its group.
1 (k1 )
μk = [x + x(k2 ) + ⋯ + x(kn ) ] ∈ Rn
n
Where each of x(k1 ) , x(k2 ) , … , x(kn ) are the training examples assigned to group mμk .
If you have a cluster centroid with 0 points assigned to it, you can randomly re-initialize that
centroid to a new point. You can also simply eliminate that cluster group.
219
After a number of iterations the algorithm will converge, where new iterations do not affect the
clusters.
Note on non-separated clusters: some datasets have no real inner separation or natural
structure. K-means can still evenly segment your data into K subsets, so can still be useful in
this case.
Optimization Objective
Recall some of the parameters we used in our algorithm:
1
J(c(i) , … , c(m) , μ1 , … , μK ) = ∑m
i=1
∣∣x(i) − μc(i) ∣∣2
m
Our optimization objective is to minimize all our parameters using the above cost function:
minc,μ J(c, μ)
That is, we are finding all the values in sets c, representing all our clusters, and μ, representing
all our centroids, that will minimize the average of the distances of every training example to
its corresponding cluster centroid.
The above cost function is often called the distortion of the training examples.
With k-means, it is not possible for the cost function to sometimes increase. It should
always descend.
Random Initialization
There's one particular recommended method for randomly initializing your cluster centroids.
220
1. Have K<m. That is, make sure the number of your clusters is less than the number of your
training examples.
2. Randomly pick K training examples. (Not mentioned in the lecture, but also be sure the selected
examples are unique).
K-means can get stuck in local optima. To decrease the chance of this happening, you can
run the algorithm on many different random initializations. In cases where K<10 it is strongly
recommended to run a loop of random initializations.
1 for i = 1 to 100:
2 randomly initialize k-means
3 run k-means to get 'c' and 'm'
4 compute the cost function (distortion) J(c,m)
5 pick the clustering that gave us the lowest cost
6
The elbow method: plot the cost J and the number of clusters K. The cost function should
reduce as we increase the number of clusters, and then flatten out. Choose K at the point
where the cost function starts to flatten out.
However, fairly often, the curve is very gradual, so there's no clear elbow.
Note: J will always decrease as K is increased. The one exception is if k-means gets stuck at a
bad local optimum.
Another way to choose K is to observe how well k-means performs on a downstream purpose.
In other words, you choose K that proves to be most useful for some goal you're trying to
achieve from using these clusters.
ML:Dimensionality Reduction
Motivation I: Data Compression
221
We may want to reduce the dimension of our features if we have a lot of redundant data.
To do this, we find two highly correlated features, plot them, and make a new line that seems to
describe both features accurately. We place all the new features on this single line.
Doing dimensionality reduction will reduce the total data we have to store in computer
memory and will speed up our learning algorithm.
Note: in dimensionality reduction, we are reducing our features rather than our number of
examples. Our variable m will stay the same size; n, the number of features each example from
x(1) to x(m) carries, will be reduced.
It is not easy to visualize data that is more than three dimensions. We can reduce the
dimensions of our data to 3 or less in order to plot it.
We need to find new features, z1 , z2 (and perhaps z3 ) that can effectively summarize all the
other features.
Example: hundreds of features related to a country's economic system may all be combined
into one feature that you call "Economic Activity."
Problem formulation
Given two features, x1 and x2 , we want to find a single line that effectively describes both
features at once. We then map our old features onto this new line to get a new single feature.
The same can be done with three features, where we map them to a plane.
The goal of PCA is to reduce the average of all the distances of every feature to the projection
line. This is the projection error.
Reduce from 2d to 1d: find a direction (a vector u(1) ∈ Rn ) onto which to project the data so as
to minimize the projection error.
Reduce from n-dimension to k-dimension: Find k vectors u(1) , u(2) , … , u(k) onto which to
project the data so as to minimize the projection error.
If we are converting from 3d to 2d, we will project our data onto two directions (a plane), so k
will be 2.
PCA is not linear regression
222
In linear regression, we are minimizing the squared error from every point to our predictor line.
These are vertical distances.
In PCA, we are minimizing the shortest distance, or shortest orthogonal distances, to our data
points.
More generally, in linear regression we are taking all our examples in x and applying the
parameters in Θ to predict y.
Data preprocessing
1 (i)
μj = ∑m
i=1 xj
m
(i) (i)
Replace each xj with xj − μj
If different features on different scales (e.g., x1 = size of house, x2 = number of bedrooms), scale
features to have comparable range of values.
Above, we first subtract the mean of each feature from the original feature. Then we scale all
(i)
(i) xj − μj
the features xj =
sj
1
∑i=1 (x(i) )(x(i) )T
m
Σ=
m
The z values are all real numbers and are the projections of our features onto u(1) .
So, PCA has two tasks: figure out u(1) , … , u(k) and also to find z1 , z2 , … , zm .
The mathematical proof for the following procedure is complicated and beyond the scope of
this course.
1
1 223
Σ= ∑m (x(i) )(x(i) )T
m i=1
We denote the covariance matrix with a capital sigma (which happens to be the same symbol
for summation, confusingly---they represent entirely different things).
Note that x(i) is an n×1 vector, (x(i) )T is an 1×n vector and X is a m×n matrix (row-wise stored
examples). The product of those will be an n×n matrix, which are the dimensions of Σ.
1 [U,S,V] = svd(Sigma);
2
What we actually want out of svd() is the 'U' matrix of the Sigma covariance matrix: U ∈ Rn×n .
U contains u(1) , … , u(n) , which is exactly what we want.
We'll assign the first k columns of U to a variable called 'Ureduce'. This will be an n×k matrix. We
compute z with:
U reduceZ T will have dimensions k×n while x(i) will have dimensions n×1. The product
U reduceT ⋅ x(i) will have dimensions k×1.
(1)
We can do this with the equation: xapprox = Ureduce ⋅ z (1) .
Note: It turns out that the U matrix has the special property that it is a Unitary Matrix. One of
the special properties of a Unitary Matrix is:
Since we are dealing with real numbers here, this is equivalent to:
U −1 = U T So we could compute the inverse and use that, but it would be a waste of energy
and compute cycles.
1
Given the average squared projection error: ∑m ∣∣x(i) − x(i)
approx ∣∣
2
m i=1
1 m
Also given the total variation in the data: ∑i=1 ∣∣x(i) ∣∣2
m
1 (i)
∑i=1 ∣∣x(i) − xapprox ∣∣2
m
In other words, the squared projection error divided by the total variation should be less than
one percent, so that 99% of the variance is retained.
2. Compute Ureduce , z, x
3. Check the formula given above that 99% of the variance is retained. If not, go to step one and
increase k.
This procedure would actually be horribly inefficient. In Octave, we will call svd:
1 [U,S,V] = svd(Sigma)
225
2
Which gives us a matrix S. We can actually check for 99% of retained variance using the S matrix
as follows:
k
∑i=1 Sii
≥ 0.99
∑ni=1 Sii
Given a training set with a large number of features (e.g. x(1) , … , x(m) ∈ R10000 ) we can use
PCA to reduce the number of features in each example of the training set (e.g.
z (1) , … , z (m) ∈ R1000 ).
Note that we should define the PCA reduction from x(i) to z (i) only on the training set and not
on the cross-validation or test sets. You can apply the mapping z(i) to your cross-validation and
test sets after it is defined on the training set.
Applications
Compressions
Speed up algorithm
Visualization of data
Choose k = 2 or k = 3
Bad use of PCA: trying to prevent overfitting. We might think that reducing the features with
PCA would be an effective way to address overfitting. It might work, but is not recommended
because it does not consider the values of our results y. Using just regularization will be at least
as effective.
Don't assume you need to do PCA. Try your full machine learning algorithm without PCA
first. Then use PCA if you find that you need it.
226
!%&+!'&$!,5
1-'&
'-2-'&
:
,'%(*++!'&
!&*&!&
1,*'%
B!& +C
L,'K
B%C
&*3
1,*'%
B!& +C
L,'K
B%C
&*3
227
1,*'%M,'L
&*3
!%&+!'&$!,5
1-'&
'-2-'&
:
,!+1$!6-'&
!&*&!&
!%
&
*(!, '2*,5 '1+ '$
1%&
&4 !&'%
B,*!$$!'&+' B, '1+&+ 2$'(A ! B!&!+ B, '1+&+
'1&,*5 GC '!&,$;GC %&,
&44(,&5(*&,C 'GC <
& K;OQQ MS;KQ J;SJR RJ;Q ML;P PQ;LSM <
!& O;RQR Q;ON J;PRQ QM NP;S KJ;LL <
&! K;PML M;NK J;ONQ PN;Q MP;R J;QMO <
1++! K;NR KS;RN J;QOO PO;O MS;S J;QL <
!&('* J;LLM OP;PS J;RPP RJ NL;O PQ;K <
KN;OLQ NP;RP J;SK QR;M NJ;R RN;M <
< < < < < < <
D*+'1*+*'%&;3!#!(!;'*E &*3
228
!%
'1&,*5
& K;P K;L
!& K;Q J;M
&! K;P J;L
1++! K;N J;O
!&('* J;O K;Q
L K;O
< < <
&*3
!%
&*3
!%&+!'&$!,5
1-'&
*!&!($'%('&&,
&$5+!+(*'$%
'*%1$-'&
!&*&!&
229
$)*!
&*3
1*'%LA!%&+!'&,'KA!%&+!'&:!&!*-'&B2,'*C
'&,'3 ! ,'(*'",, ,+'+,'%!&!%!6, (*'"-'&**'*;
1*'%&A!%&+!'&,'#A!%&+!'&:!&2,'*+
'&,'3 ! ,'(*'",, ,9+'+,'%!&!%!6, (*'"-'&**'*;
&*3
&*3
230
&*3
!%&+!'&$!,5
1-'&
*!&!($'%('&&,
&$5+!+$'*!, %
!&*&!&
*!&!&+,:
*(*'++!&B,1*+$!&@%&&'*%$!6-'&C:
($ 3!,
;
!*&,,1*+'&!*&,+$+B;;9+!6'
'1+9
&1%*'*''%+C9+$,1*+,'
2'%(*$
*&'2$1+;
&*3
231
$)*
1,*'%L,'K 1,*'%M,'L
&*3
&*3
&*3
232
$)*!$
*%&&'*%$!6-'&B&+1*2*5,1* +
6*'%&C&'(-'&$$5,1*+$!&:
Sigma =
[U,S,V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce’*x;
&*3
!%&+!'&$!,5
1-'&
'&+,*1-'&*'%
'%(*++
*(*+&,-'&
!&*&!&
&*3
233
!%&+!'&$!,5
1-'&
''+!&, &1%*'
(*!&!($'%('&&,+
!&*&!&
&*3
&*3
234
)!
*
[U,S,V] = svd(Sigma)
!#+%$$+,2$1''*3 !
BSSV'2*!&*,!&C
&*3
!%&+!'&$!,5
1-'&
2!'*
(($5!&
!&*&!&
!"!
4,*,!&(1,+:
&$$,+,:
3,*!&!&+,:
A '%(*++!'&
A 1%%'*5@!+#&,'+,'*,
A (1($*&!&$'*!, %
A !+1$!6-'&
&*3
!' ""
+!&+,','*1, &1%*'
,1*+,'
1+93*,1*+9$++$!#$5,''2*,;
&*3
!#!(
+!&'
+5+,%:
A ,,*!&!&+,
A 1&,'*1!&!%&+!'&,',
A *!&$'!+-**++!'&'&
A +,'&,+,+,:
(,';1&'&
'3'1,'!&, 3 '$, !&3!, '1,1+!&8
'*!%($%&-&9*+,,*5*1&&!&3 ,2*5'13&,,'
'3!, , '*!!&$@*3,;&$5!, ,'+&=,'3 ,
5'13&,9, &!%($%&,&'&+!*1+!&;;
&*3
236
You can pick up where you left off. Just join a new session and we’ll reset your
deadlines. Join a session
ML:Anomaly Detection
Problem Motivation
Just like in other learning problems, we are given a dataset x(1) , x(2) , … , x(m) .
We are then given a new example, xtest , and we want to know whether this new
example is abnormal/anomalous.
We define a "model" p(x) that tells us the probability the example is not
anomalous. We also use a threshold ϵ (epsilon) as a dividing line so we can say
which examples are anomalous and which are not.
If our anomaly detector is flagging too many anomalous examples, then we need
to decrease our threshold ϵ
Gaussian Distribution
The Gaussian Distribution is a familiar bell-shaped curve that can be described by
a function N (μ, σ 2 )
237
Let x∈ℝ. If the probability distribution of x is Gaussian with mean μ, variance σ 2 ,
then:
x ∼ N (μ, σ 2 )
Mu, or μ, describes the center of the curve, called the mean. The width of the
curve is described by sigma, or σ, called the standard deviation.
1 x−μ 2
1 − ( )
p(x; μ, σ 2 ) = e 2 σ
σ (2π)
We can estimate the parameter μ from a given dataset by simply taking the
average of all the examples:
m
1
μ= ∑ x(i)
m i=1
We can estimate the other parameter, σ 2 , with our familiar squared error formula:
m
1
2
σ = ∑(x(i) − μ)2
m i=1
Algorithm
Given a training set of examples, {x(1) , … , x(m) } where each example is a
vector, x ∈ Rn .
The algorithm
m
1
∑ xj
(i)
Calculate μj =
m i=1
m
1
∑(xj − μj )2
(i)
Calculate σj2 =
m i=1
n n
1 (xj − μj )2
p(x) = ∏ p(xj ; μj , σj2 ) =∏ exp(− )
j=1 j=1
2π σj 2σj2
Anomaly if p(x)<ϵ
m
1
A vectorized version of the calculation for μ is μ = ∑ x(i) . You can vectorize
m i=1
σ 2 similarly.
Among that data, take a large proportion of good, non-anomalous data for the
training set on which to train p(x).
In other words, we split the data 60/20/20 training/CV/test and then split the
anomalous examples 50/50 between the CV and test sets.
Algorithm evaluation:
Precision/recall
F1 score
We have a very small number of positive examples (y=1 ... 0-20 examples is
common) and a large number of negative (y=0) examples.
We have many different "types" of anomalies and it is hard for any algorithm to learn
from positive examples what the anomalies look like; future anomalies may look
nothing like any of the anomalous examples we've seen so far.
240
Use supervised learning when...
We have a large number of both positive and negative examples. In other words,
the training set is more evenly divided into classes.
We have enough positive examples for the algorithm to get a sense of what new
positives examples look like. The future positive examples are likely similar to the
ones in the training set.
We can check that our features are gaussian by plotting a histogram of our data
and checking for the bell-shaped curve.
Some transforms we can try on an example feature x that does not have the bell-
shaped curve are:
log(x)
log(x+1)
x
x1/3
We can play with each of these to try and achieve the gaussian shape in our data.
There is an error analysis procedure for anomaly detection that is very similar to
the one in supervised learning.
Our goal is for p(x) to be large for normal examples and small for anomalous
examples.
One common problem is when p(x) is similar for both types of examples. In this
case, you need to examine the anomalous examples that are giving high
probability in detail and try to figure out new features that will better distinguish
the data.
In general, choose features that might take on unusually large or small values in
the event of an anomaly.
241
Multivariate Gaussian Distribution
(Optional)
The multivariate gaussian distribution is an extension of anomaly detection and
may (or may not) catch more anomalies.
Instead of modeling p(x1 ), p(x2 ), … separately, we will model p(x) all in one go.
Our parameters will be: μ ∈ Rn and Σ ∈ Rn×n
1
p(x; μ, Σ) = exp(−1/2(x − μ)T Σ−1 (x − μ))
(2π)n/2 ∣Σ∣1/2
The important effect is that we can model oblong gaussian contours, allowing us
to better fit data that might not fit into the normal circular contours.
Varying Σ changes the shape, width, and orientation of the contours. Changing μ
will move the center of the distribution.
Check also:
The original model for p(x) corresponds to a multivariate Gaussian where the
contours of p(x; μ, Σ) are axis-aligned.
ML:Recommender Systems
Problem Formulation
Recommendation is currently a very popular application of machine learning.
Say we are trying to recommend movies to customers. We can use the following
definitions
nu = number of users
nm = number of movies
r(i, j) = 1 if user j has rated movie i
y(i, j) = rating given by user j to movie i (defined only if r(i,j)=1)
One approach is that we could do linear regression for every single user. For each
user j, learn a parameter θ (j) ∈ R3 . Predict user j as rating movie i with
(θ(j) )T x(i) stars.
This is our familiar linear regression. The base of the first summation is choosing
all i such that r(i, j) = 1.
n n n
1 u λ u
= ∑ ∑ ((θ ) (x ) − y ) + ∑ ∑(θk )2
(j) T (i) (i,j) 2 (j)
minθ(1) ,…,θ(nu )
2 j=1 2 j=1
i:r(i,j)=1 k=1
We can apply our linear regression gradient descent update using the above cost
function.
1
The only real difference is that we eliminate the constant .
m
Collaborative Filtering
It can be very difficult to find features such as "amount of romance" or "amount of
action" in a movie. To figure this out, we can use feature finders.
We can let the users tell us how much they like the different genres, providing
their parameter vector immediately for us.
To infer the features from given parameters, we use the squared error function
with regularization over all the users:
n n n
1 m λ m
minx(1) ,…,x(nm ) ∑ ∑ ((θ ) x − y ) + ∑ ∑(xk )2
(j) T (i) (i,j) 2 (i)
2 2
i=1 j:r(i,j)=1 i=1 k=1
You can also randomly guess the values for theta to guess the features
repeatedly. You will actually converge to a good set of features.
It looks very complicated, but we've only combined the cost function for theta and
the cost function for x.
Because the algorithm can learn them itself, the bias units where x0=1 have been
removed, therefore x∈ℝn and θ∈ℝn.
1. Initialize x(i) , ..., x(nm ) , θ (1) , ..., θ (nu ) to small random values. This serves to
break symmetry and ensures that the algorithm learns features x(i) , ..., x(nm ) that
are different from each other.
2. Minimize J(x(i) , ..., x(nm ) , θ (1) , ..., θ (nu ) ) using gradient descent (or an advanced
optimization algorithm).E.g. for every j = 1, ..., nu , i = 1, ...nm :x(i) := x(i) −
⎛ ⎞ (j)
k k
⎝ i:r(i,j)=1 ⎠
3. For a user with parameters θ and a movie with (learned) features x, predict a star
rating of θ T x.
Predicting how similar two movies i and j are can be done using the distance
between their respective feature vectors x. Specifically, we are looking for a small
value of ∣∣x(i) − x(j) ∣∣.
245
Implementation Detail: Mean
Normalization
If the ranking system for movies is used from the previous lectures, then new
users (who have watched no movies), will be assigned new movies incorrectly.
Specifically, they will be assigned θ with all components equal to zero due to the
minimization of the regularization term. That is, we assume that the new user will
rank all movies 0, which does not seem intuitively correct.
We rectify this problem by normalizing the data relative to the mean. First, we use
a matrix Y to store the data from previous ratings, where the ith row of Y is the
ratings for the ith movie and the jth column corresponds to the ratings for the jth
user.
μ = [μ1 , μ2 , … , μnm ]
such that
∑j:r(i,j)=1 Yi,j
μi = ∑j r(i,j)
Which is effectively the mean of the previous ratings for the ith movie (where only
movies that have been watched by users are counted). We now can normalize the
data by subtracting u, the mean rating, from the actual ratings for each user
(column in matrix Y):
5 5 0 0 2.5
⎡ ⎤ ⎡ ⎤
4 ? ? 0 2
Y = ⎢ ⎥, μ = ⎢ ⎥
⎢ ⎥ ⎢ ⎥
0 0 5 4 2.25
⎣ ⎦ ⎣ ⎦
0 0 5 0 1.25
(θ(j) )T x(i) + μi
Now, for a new user, the initial predicted values will be equal to the μ term instead
of simply being initialized to zero, which is more accurate.
247
"
+#$#
"
"'%
"
"
+#$#
$$2#
" % #
"
248
*
&
# +-, +.,
+/,
#+0,
+
, +
,
($#$ : : 5 5 51; 5
"(" : . . 5 615 5156
'$!'!!# ( . 9 5 . 51;; 5
#$ !"## 5 5 : 9 516 615
) "#(#1"$ 5 5 : . 5 51;
"'#"/"!"$"1"$'#"#
()$#$"#1
"% ))$
")
"
'#"#"$
(35 $")#4
"%+'#" (34
<!"$"($
" "'#"
<$'"($ " " (
"'#"/
(/!"$"%0
< 1 (#"$+'#"
"0
")
" 0
")
249
"$#$'!$0
")
"
+#$#
"%(
$"
"
#
# +-, +.,
+/,
#+0,
+
, +
,
($#$ : : 5 5 51; 5
"(" : . . 5 615 5156
'$!'!!# . 9 5 . 51;; 5
(
#$ !" 5 5 : 9 516 615
##
) "#(#1"$ 5 5 : . 5 51;
")
250
#
# +-, +.,
+/,
#+0,
+
, +
,
($#$ : : 5 5 . .
"(" : . . 5 . .
'$!'!!# . 9 5 . . .
(
#$ !" 5 5 : 9 . .
##
) "#(#1"$ 5 5 : . . .
")
( /$
" 0
")
")
251
"
+#$#
"%(
$" "$
"
( /#%$
0
,
#'$ '#+0
")
")
252
"
+#$#
$ ",% 0
)"$"*
$ ",%
"
")
")
253
#
"!" '$/)"$'"($ "1
)$ (#"$$ (.
")
"
+#$#
!$%
$0
",%
"
")
254
'
)
"'#"/
(!"$0
#":3(40
")
255
You can pick up where you left off. Just join a new session and we’ll reset
your deadlines. Join a session
1
cost(θ, (x(i) , y (i) )) = (hθ (x(i) ) − y (i) )2
2
The only difference in the above cost function is the elimination of the m
1
constant within .
2
m
1
Jtrain (θ) = ∑ cost(θ, (x(i) , y (i) ))
m
i=1
256
Jtrain is now just the average of the cost applied to all of our training
examples.
2. For i = 1…m
(i)
Θj := Θj − α(hΘ (x(i) ) − y (i) ) ⋅ xj
This algorithm will only try to fit one training example at a time. This way we
can make progress in gradient descent without having to scan all m training
examples first. Stochastic gradient descent will be unlikely to converge at
the global minimum and will instead wander around it randomly, but
usually yields a result that is close enough. Stochastic gradient descent will
usually take 1-10 passes through your data set to get near the global
minimum.
Repeat:
i+9
1
θj := θj − α ∑(hθ (x(k) ) − y (k) )xj
(k)
10 k=i
257
We're simply summing over ten examples at a time. The advantage of
computing more than one example at a time is that we can use vectorized
implementations over the b examples.
One strategy is to plot the average cost of the hypothesis applied to every
1000 or so training examples. We can compute and save these costs during
the gradient descent iterations.
With a smaller learning rate, it is possible that you may get a slightly better
solution with stochastic gradient descent. That is because stochastic
gradient descent will oscillate and jump around the global minimum, and it
will make smaller random jumps with a smaller learning rate.
If you increase the number of examples you average over to plot the
performance of your algorithm, the plot's line will become smoother.
With a very small number of examples for the average, the line will be too
noisy and it will be difficult to find the trend.
However, this is not often done because people don't want to have to fiddle
with even more parameters.
Online Learning
258
With a continuous stream of users to a website, we can run an endless loop
that gets (x,y), where we collect some user actions for the features in x to
predict some behavior y.
You can update θ for each individual (x,y) pair as you collect them. This way,
you can adapt to new pools of users, since you are continuously updating
theta.
You can split your training set into z subsets corresponding to the number
of machines you have. On each of those machines calculate
q
∑(hθ (x(i) ) − y (i) ) ⋅ xj , where we've split the data starting at p and
(i)
i=p
ending at q.
MapReduce will take all these dispatched (or 'mapped') jobs and 'reduce'
them by calculating:
1
Θj := Θj − α (temp(1) (2) (z)
j + tempj + ⋯ + tempj )
z
For all j = 0, … , n.
This is simply taking the computed cost from all the machines, calculating
their average, multiplying by the learning rate, and updating theta.
For neural networks, you can compute forward propagation and back
propagation on subsets of your data on many machines. Those machines
can report their derivatives back to a 'master' server that will combine
259
them.
260
*+$
% !&$*&!&
*&!&2!,
$*,+,+
!&*&!&
$++!4,2&'&0+$2'*+:
::7F,'7,2'7,''G7F, &7, &G:
0*4
'**#+,
,AAAAA+:
*!&!&+,+!5B%!$$!'&+C
=
,<+&',2 ' +, +,$'*!, %, ,2!&+:
,<+2 ' +, %'+,,:>
D!0**'%&#'&*!$$7LJJKE &*2
**'*
*+$
% !&$*&!&
,' +-
*!&,+&,
!&*&!&
(,F
B'*1*4C
G
&*2
(,F
B'*1*4C
G
&*2
262
(,F
B'*1*4 C
G
&*2
K: &'%$4+ 0B*'**C
,*!&!&3%($+
L: (,F
'*F
B'*
1*4C
G
G
&*2
*+$
% !&$*&!&
!&!@,
*!&,+&,
!&*&!&
263
"
, *!&,+&,9+$$3%($+!& !,*-'&
,' +-*!&,+&,9+K3%($!& !,*-'&
!&!@, *!&,+&,9+3%($+!& !,*-'&
&*2
"
4
:
(,F
'*
F
B'*1*4C
G
G
&*2
*+$
% !&$*&!&
,' +-
*!&,+&,
'&1*&
!&*&!&
264
, *!&,+&,9
$',+0&-'&', &0%*'!,*-'&+'
*!&,+&,:
,' +-*!&,+&,9
! $ ! '*0(-&
0*!&$*&!&7'%(0,
0+!&:
1*4KJJJ!,*-'&+B+4C7($', 1*
'1*, $+,KJJJ3%($+(*'++4$'*!, %:
&*2
$', 71*'1*,
$+,KJJJB+4C3%($+
':'!,*-'&+ ':'!,*-'&+
':'!,*-'&+ ':'!,*-'&+
&*2
*+$
% !&$*&!&
&$!&$*&!&
!&*&!&
!((!&+*1!2+!,2 *0+*'%+7+(!+'*!!&&
+-&-'&74'0'*,'+ !(, !*(#'*+'%+#!&(*!7
&0+*++'%-%+ ''+,'0+4'0*+ !((!&+*1!BC7
+'%-%+&',BC:
,0*+(,0*(*'(*-+'0+*7''*!!&?+-&-'&&
+#!&(*!:2&,,'$*& ,''(-%!5(*!:
&*2
266
!
*'0,+* B$*&!&,'+* C
+*+* +'*=&*'!( '&KJQJ(%*>
1KJJ( '&+!&+,'*:!$$*,0*&KJ*+0$,+:
,0*+'( '&7 '2%&42'*+!&0+*)0*4%,
&%'( '&7 '2%&42'*+!&)0*4%, +*!(-'&
'( '&7,:
!0+*$!#+'&$!&#: ', *2!+:
*&:
+,'+ '20+*, KJ( '&+, 4<*%'+,$!#$4,'$!#'&:
, *3%($+9 ''+!&+(!$'*+,'+ '20+*80+,'%!5
+$-'&'&2+*-$+8(*'0,*'%%&-'&8;
&*2
*+$
% !&$*&!&
(@*0&
,(*$$$!+%
!&*&!&
"
, *!&,+&,9
!&K9+
!&L9+
!&M9+
!&N9+
'%(0,*K
*!&!&+,
'%(0,*L '%!&*+0$,+
'%(0,*M
D /(9??'(&$!(*,:'*?,!$?KPRLN?'%(0,*@4@"E
'%(0,*N
&*2
"
&4$*&!&$'*!, %+&3(*+++'%(0-&+0%+
'0&-'&+'1*, ,*!&!&+,:
::'*1&'(-%!5-'&72!, $'!+-**++!'&7&9
&*2
"
'*K
*!&!&+,
'*L '%!&*+0$,+
'*M
D /(9??'(&$!(*,:'*?,!$?KJJLOP?(0@ '*N
B&,*$@(*'++!&@0&!,C@4@!1#@KJJLOPE &*2