Deep Learning Andrew NG
Deep Learning Andrew NG
Deep Learning Andrew NG
DeepLearning.ai contains these five courses which can be taken on Coursera. The five courses titles
are:
Table of contents
Neural Networks and Deep Learning
Table of contents
Course summary
Introduction to deep learning
What is a (Neural Network) NN?
Supervised learning with neural networks
Why is deep learning taking off?
Neural Networks Basics
Binary classification
Logistic regression
Logistic regression cost function
Gradient Descent
Derivatives
More Derivatives examples
Computation graph
Derivatives with a Computation Graph
Logistic Regression Gradient Descent
Gradient Descent on m Examples
Vectorization
Vectorizing Logistic Regression
Notes on Python and NumPy
General Notes
Shallow neural networks
Neural Networks Overview
Neural Network Representation
Computing a Neural Network's Output
Vectorizing across multiple examples
Activation functions
Why do you need non-linear activation functions?
Derivatives of activation functions
Gradient descent for Neural Networks
Random Initialization
Deep Neural Networks
Deep L-layer neural network
Forward Propagation in a Deep Network
Getting your matrix dimensions right
Why deep representations?
Building blocks of deep neural networks
Forward and Backward Propagation
Parameters vs Hyperparameters
What does this have to do with the brain
Extra: Ian Goodfellow interview
Course summary
Here are the course summary as its given on the course link:
If you want to break into cutting-edge AI, this course will help you do so. Deep learning
engineers are highly sought after, and mastering deep learning will give you numerous new
career opportunities. Deep learning is also a new "superpower" that will let you build AI systems
that just weren't possible a few years ago.
In this course, you will learn the foundations of deep learning. When you finish this class, you
will:
This course also teaches you how Deep Learning actually works, rather than presenting only a
cursory or surface-level description. So after completing it, you will be able to apply deep
learning to a your own applications. If you are looking for a job in AI, after this course you will
also be able to answer basic interview questions.
Basically a single neuron will calculate weighted sum of input(W.T*X) and then we can set a
threshold to predict output in a perceptron. If weighted sum of input cross the threshold,
perceptron fires and if not then perceptron doesn't predict.
Disadvantage of perceptron is that it only output binary values and if we try to give small change
in weight and bais then perceptron can flip the output. We need some system which can modify
the output slightly according to small change in weight and bias. Here comes sigmoid function in
picture.
If we change perceptron with a sigmoid function, then we can make slight change in output.
e.g. output in perceptron = 0, you slightly changed weight and bias, output becomes = 1 but
actual output is 0.7. In case of sigmoid, output1 = 0, slight change in weight and bias, output =
0.7.
If we apply sigmoid activation function then Single neuron will act as Logistic Regression.
we can understand difference between perceptron and sigmoid function by looking at sigmoid
function graph.
Simple NN graph:
Image taken from tutorialspoint.com
RELU stands for rectified linear unit is the most popular activation function right now that makes
deep NNs train faster now.
Hidden layers predicts connection between inputs automatically, thats what deep learning is
good at.
Supervised learning means we have the (X,Y) and we need to get the function that maps X to Y.
i. Data:
Using this image we can conclude:
For small data NN can perform as Linear regression or SVM (Support vector machine)
For big data a small NN is better that SVM
For big data a big NN is better that a medium NN is better that small NN.
Hopefully we have a lot of data because the world is using the computer a little bit
more
Mobiles
IOT (Internet of things)
ii. Computation:
GPUs.
Powerful CPUs.
Distributed computing.
ASICs
iii. Algorithm:
a. Creative algorithms has appeared that changed the way NN works.
For example using RELU function is so much better than using SIGMOID function in
training a NN because it helps with the vanishing gradient problem.
Binary classification
Mainly he is talking about how to do a logistic regression to make a binary classifier.
Logistic regression
Algorithm is used for classification algorithm of 2 classes.
Equations:
Simple equation: y = wx + b
If x is a vector: y = w(transpose)x + b
If we need y to be in between 0 and 1 (probability): y = sigmoid(w(transpose)x + b)
In some notations this might be used: y = sigmoid(w(transpose)x)
While b is w0 of w and we add x0 = 1 . but we won't use this notation in the course
(Andrew said that the first notation is better).
In binary classification Y has to be between 0 and 1 .
In the last equation w is a vector of Nx and b is a real number
Gradient Descent
We want to predict w and b that minimize the cost function.
Our cost function is convex.
First we initialize w and b to 0,0 or initialize them to a random value in the convex function and
then try to improve the values the reach minimum value.
In Logistic regression people always use 0,0 instead of random.
The gradient decent algorithm repeats: w = w - alpha * dw where alpha is the learning rate and
dw is the derivative of w (Change to w ) The derivative is also the slope of w
Looks like greedy algorithms. the derivative give us the direction to improve our parameters.
b = b - alpha * d(J(w,b) / db) (how much the function slopes in the d direction)
Derivatives
We will talk about some of required calculus.
You don't need to be a calculus geek to master deep learning but you'll need some skills from it.
Derivative of a linear line is its slope.
ex. f(a) = 3a d(f(a))/d(a) = 3
if a = 2 then f(a) = 6
if we move a a little bit a = 2.001 then f(a) = 6.003 means that we multiplied the
derivative (Slope) to the moved area and added it to the last result.
To conclude, Derivative is the slope and slope is different in different points in the function thats
why the derivative is a function.
Computation graph
Its a graph that organizes the computation from left to right.
X1 Feature
X2 Feature
W1 Weight of the first feature.
W2 Weight of the second feature.
B Logistic Regression parameter.
M Number of training examples
Y(i) Expected output of i
So we have:
Then from right to left we will calculate derivations compared to the result:
From the above we can conclude the logistic regression pseudo code:
# Backward pass
dz(i) = a(i) - Y(i)
dw1 += dz(i) * x1(i)
dw2 += dz(i) * x2(i)
db += dz(i)
J /= m
dw1/= m
dw2/= m
db/= m
# Gradient descent
w1 = w1 - alpha * dw1
w2 = w2 - alpha * dw2
b = b - alpha * db
The above code should run for some iterations to minimize error.
Vectorization is so important on deep learning to reduce loops. In the last code we can make the
whole loop in one step using vectorization!
Vectorization
Deep learning shines when the dataset are big. However for loops will make you wait a lot for a
result. Thats why we need vectorization to get rid of some of our for loops.
NumPy library (dot) function is using vectorization by default.
The vectorization can be done on CPU or GPU thought the SIMD operation. But its faster on
GPU.
Whenever possible avoid for loops.
Most of the NumPy library methods are vectorized version.
As an input we have a matrix X and its [Nx, m] and a matrix Y and its [Ny, m] .
We will then compute at instance [z1,z2...zm] = W' * X + [b,b,...b] . This can be written in
python as:
In NumPy, obj.reshape(1,4) changes the shape of the matrix by broadcasting the values.
Reshape is cheap in calculations so put it everywhere you're not sure about the calculations.
Broadcasting works when you do a matrix operation with matrices that doesn't match for the
operation, in this case NumPy automatically makes the shapes ready for the operation by
broadcasting the values.
In general principle of broadcasting. If you have an (m,n) matrix and you add(+) or subtract(-) or
multiply(*) or divide(/) with a (1,n) matrix, then this will copy it m times into an (m,n) matrix. The
same with if you use those operations with a (m , 1) matrix, then this will copy it n times into (m,
n) matrix. And then apply the addition, subtraction, and multiplication of division element wise.
Jupyter / IPython notebooks are so useful library in python that makes it easy to integrate code
and document at the same time. It runs in the browser and doesn't need an IDE to run.
To open Jupyter Notebook, open the command line and call: jupyter-notebook It should be
installed to work.
s = sigmoid(x)
ds = s * (1 - s) # derivative using calculus
General Notes
The main steps for building a Neural Network are:
Define the model structure (such as number of input features and outputs)
Initialize the model's parameters.
Loop.
Calculate current loss (forward propagation)
Calculate current gradient (backward propagation)
Update parameters (gradient descent)
Preprocessing the dataset is important.
Tuning the learning rate (which is an example of a "hyperparameter") can make a big difference
to the algorithm.
kaggle.com is a good place for datasets and competitions.
Pieter Abbeel is one of the best in deep reinforcement learning.
X1 \
X2 ==> z = XW + B ==> a = Sigmoid(z) ==> l(a,Y)
X3 /
X1 \
X2 => z1 = XW1 + B1 => a1 = Sigmoid(z1) => z2 = a1W2 + B2 => a2 = Sigmoid(z2) => l(a2,Y
X3 /
X is the input vector (X1, X2, X3) , and Y is the output variable (1x1)
We are talking about 2 layers NN. The input layer isn't counted.
Nx = 3
for i = 1 to m
z[1, i] = W1*x[i] + b1 # shape of z[1, i] is (noOfHiddenNeurons,1)
a[1, i] = sigmoid(z[1, i]) # shape of a[1, i] is (noOfHiddenNeurons,1)
z[2, i] = W2*a[1, i] + b2 # shape of z[2, i] is (1,1)
a[2, i] = sigmoid(z[2, i]) # shape of a[2, i] is (1,1)
In the last example we can call X = A0 . So the previous step can be rewritten as:
Activation functions
So far we are using sigmoid, but in some cases other functions can be a lot better.
Sigmoid can lead us to gradient decent problem where the updates are so low.
Sigmoid activation function range is [0,1] A = 1 / (1 + np.exp(-z)) # Where z is the input
matrix
In NumPy we can implement Tanh using one of these methods: A = (np.exp(z) - np.exp(-
z)) / (np.exp(z) + np.exp(-z)) # Where z is the input matrix
It turns out that the tanh activation usually works better than sigmoid activation function for
hidden units because the mean of its output is closer to zero, and so it centers the data better for
the next layer.
Sigmoid or Tanh function disadvantage is that if the input is too small or too high, the slope will
be near zero which will cause us the gradient decent problem.
One of the popular activation functions that solved the slow gradient decent is the RELU
function. RELU = max(0,z) # so if z is negative the slope is 0 and if z is positive the
slope remains linear.
So here is some basic rule for choosing activation functions, if your classification is between 0
and 1, use the output activation as sigmoid and the others as RELU.
Leaky RELU activation function different of RELU is that if the input is negative the slope will be
so small. It works as RELU but most people uses RELU. Leaky_RELU = max(0.01z,z) #the 0.01 can
be a parameter for your algorithm.
g(z) = 1 / (1 + np.exp(-z))
g'(z) = (1 / (1 + np.exp(-z))) * (1 - (1 / (1 + np.exp(-z))))
g'(z) = g(z) * (1 - g(z))
g(z) = np.maximum(0,z)
g'(z) = { 0 if z < 0
1 if z >= 0 }
g(z) = np.maximum(0.01 * z, z)
g'(z) = { 0.01 if z < 0
1 if z >= 0 }
NN parameters:
n[0] = Nx
n[1] = NoOfHiddenNeurons
n[2] = NoOfOutputNeurons = 1
W1 shape is (n[1],n[0])
b1 shape is (n[1],1)
W2 shape is (n[2],n[1])
b2 shape is (n[2],1)
Repeat:
Compute predictions (y'[i], i = 0,...m)
Get derivatives: dW1, db1, dW2, db2
Update: W1 = W1 - LearningRate * dW1
b1 = b1 - LearningRate * db1
W2 = W2 - LearningRate * dW2
b2 = b2 - LearningRate * db2
Forward propagation:
Z1 = W1A0 + b1 # A0 is X
A1 = g1(Z1)
Z2 = W2A1 + b2
A2 = Sigmoid(Z2) # Sigmoid because the output is between 0 and 1
Backpropagation (derivations):
Random Initialization
In logistic regression it wasn't important to initialize the weights randomly, while in NN we have
to initialize them randomly.
If we initialize all the weights with zeros in NN it won't work (initializing bias with zero is OK):
all hidden units will be completely identical (symmetric) - compute exactly the same function
on each gradient descent iteration all the hidden units will always update the same
We need small values because in sigmoid (or tanh), for example, if the weight is too large you
are more likely to end up even at the very start of training with very large values of Z. Which
causes your tanh or your sigmoid activation function to be saturated, thus slowing down
learning. If you don't have any sigmoid or tanh activation functions throughout your neural
network, this is less of an issue.
Constant 0.01 is alright for 1 hidden layer networks, but if the NN is deep this number can be
changed but it will always be a small number.
n[0] denotes the number of neurons input layer. n[L] denotes the number of neurons in
output layer.
g[l] is the activation function.
a[l] = g[l](z[l])
These were the notation we will use for deep neural network.
So we have:
A vector n of shape (1, NoOfLayers+1)
A vector g of shape (1, NoOfLayers)
A list of different shapes w based on the number of neurons on the previous and the
current layer.
A list of different shapes b based on the number of neurons on the current layer.
We can't compute the whole layers forward propagation without a for loop so its OK to have a
for loop here.
The dimensions of the matrices are so important you need to figure it out.
When starting on an application don't start directly by dozens of hidden layers. Try the simplest
solutions (e.g. Logistic Regression), then try the shallow neural network and so on.
Input A[l-1]
Z[l] = W[l]A[l-1] + b[l]
A[l] = g[l](Z[l])
Output A[l], cache(Z[l])
Parameters vs Hyperparameters
Main parameters of the NN is W and b
Hyper parameters (parameters that control the algorithm) are like:
Learning rate.
Number of iteration.
Number of hidden layers L .
Number of hidden units n .
Choice of activation functions.
You have to try values yourself of hyper parameters.
In the earlier days of DL and ML learning rate was often called a parameter, but it really is (and
now everybody call it) a hyperparameter.
On the next course we will see how to optimize hyperparameters.
Table of contents
Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
Table of contents
Course summary
Practical aspects of Deep Learning
Train / Dev / Test sets
Bias / Variance
Basic Recipe for Machine Learning
Regularization
Why regularization reduces overfitting?
Dropout Regularization
Understanding Dropout
Other regularization methods
Normalizing inputs
Vanishing / Exploding gradients
Weight Initialization for Deep Networks
Numerical approximation of gradients
Gradient checking implementation notes
Initialization summary
Regularization summary
Optimization algorithms
Mini-batch gradient descent
Understanding mini-batch gradient descent
Exponentially weighted averages
Understanding exponentially weighted averages
Bias correction in exponentially weighted averages
Gradient descent with momentum
RMSprop
Adam optimization algorithm
Learning rate decay
The problem of local optima
Hyperparameter tuning, Batch Normalization and Programming Frameworks
Tuning process
Using an appropriate scale to pick hyperparameters
Hyperparameters tuning in practice: Pandas vs. Caviar
Normalizing activations in a network
Fitting Batch Normalization into a neural network
Why does Batch normalization work?
Batch normalization at test time
Softmax Regression
Training a Softmax classifier
Deep learning frameworks
TensorFlow
Extra Notes
Course summary
Here are the course summary as its given on the course link:
This course will teach you the "magic" of getting deep learning to work well. Rather than the
deep learning process being a black box, you will understand what drives performance, and be
able to more systematically get good results. You will also learn TensorFlow.
Bias / Variance
Bias / Variance techniques are Easy to learn, but difficult to master.
So here the explanation of Bias / Variance:
If your model is underfitting (logistic regression of non linear data) it has a "high bias"
If your model is overfitting then it has a "high variance"
Your model will be alright if you balance the Bias / Variance
For more:
Another idea to get the bias / variance if you don't have a 2D plotting mechanism:
High variance (overfitting) for example:
Training error: 1%
Dev error: 11%
high Bias (underfitting) for example:
Training error: 15%
Dev error: 14%
high Bias (underfitting) && High variance (overfitting) for example:
Training error: 15%
Test error: 30%
Best:
Training error: 0.5%
Test error: 1%
These Assumptions came from that human has 0% error. If the problem isn't like that you'll
need to use human error as baseline.
Regularization
Adding regularization to NN will help it reduce variance (overfitting)
L1 matrix norm:
||W|| = Sum(|w[i,j]|) # sum of absolute values of all w
L2 matrix norm because of arcane technical math reasons is called Frobenius norm:
||W||^2 = Sum(|w[i,j]|^2) # sum of all w squared
The L1 regularization version makes a lot of w values become zeros, which makes the model
size smaller.
L2 regularization is being used much more often.
lambda here is the regularization parameter (hyperparameter)
We stack the matrix as one vector (mn,1) and then we apply sqrt(w1^2 + w2^2.....)
In practice this penalizes large weights and effectively limits the freedom in your model.
Intuition 1:
If lambda is too large - a lot of w's will be close to zeros which will make the NN simpler
(you can think of it as it would behave closer to logistic regression).
If lambda is good enough it will just reduce some weights that makes the neural network
overfit.
Intuition 2 (with tanh activation function):
If lambda is too large, w's will be small (close to zero) - will use the linear part of the tanh
activation function, so we will go from non linear activation to roughly linear which would
make the NN a roughly linear classifier.
If lambda good enough it will just make some of tanh activations roughly linear which will
prevent overfitting.
Implementation tip: if you implement gradient descent, one of the steps to debug gradient descent
is to plot the cost function J as a function of the number of iterations of gradient descent and you
want to see that the cost function J decreases monotonically after every elevation of gradient
descent with regularization. If you plot the old definition of J (no regularization) then you might not
see it decrease monotonically.
Dropout Regularization
In most cases Andrew Ng tells that he uses the L2 regularization.
Vector d[l] is used for forward and back propagation and is the same for them, but it is different
for each iteration (pass) or training example.
At test time we don't use dropout. If you implement dropout at test time - it would add noise to
predictions.
Understanding Dropout
In the previous video, the intuition was that dropout randomly knocks out units in your network.
So it's as if on every iteration you're working with a smaller NN, and so using a smaller NN seems
like it should have a regularizing effect.
Another intuition: can't rely on any one feature, so have to spread out weights.
It's possible to show that dropout has a similar effect to L2 regularization.
Dropout can have different keep_prob per layer.
The input layer dropout has to be near 1 (or 1 - no dropout) because you don't want to eliminate
a lot of features.
If you're more worried about some layers overfitting than others, you can set a lower keep_prob
for some layers than others. The downside is, this gives you even more hyperparameters to
search for using cross-validation. One other alternative might be to have some layers where you
apply dropout and some layers where you don't apply dropout and then just have one
hyperparameter, which is a keep_prob for the layers for which you do apply dropouts.
A lot of researchers are using dropout with Computer Vision (CV) because they have a very big
input size and almost never have enough data, so overfitting is the usual problem. And dropout
is a regularization technique to prevent overfitting.
A downside of dropout is that the cost function J is not well defined and it will be hard to debug
(plot J by iteration).
To solve that you'll need to turn off dropout, set all the keep_prob s to 1, and then run the
code and check that it monotonically decreases J and then turn on the dropouts again.
Normalizing inputs
If you normalize your inputs this will speed up the training process a lot.
Normalization are going on these steps:
i. Get the mean of the training set: mean = (1/m) * sum(x(i))
ii. Subtract the mean from each input: X = X - mean
This makes your inputs centered around 0.
iii. Get the variance of the training set: variance = (1/m) * sum(x(i)^2)
iv. Normalize the variance. X /= variance
These steps should be applied to training, dev, and testing sets (but using mean and variance of
the train set).
Why normalize?
If we don't normalize the inputs our cost function will be deep and its shape will be
inconsistent (elongated) then optimizing it will take a long time.
But if we normalize it the opposite will occur. The shape of the cost function will be
consistent (look more symmetric like circle in 2D example) and we can use a larger learning
rate alpha - the optimization will be faster.
Then:
Y' = W[L]W[L-1].....W[2]W[1]X
if W[l] = [1.5 0]
[0 1.5] (l != L because of different dimensions in the output layer)
Y' = W[L] [1.5 0]^(L-1) X = 1.5^L # which will be very large
[0 1.5]
if W[l] = [0.5 0]
[0 0.5]
Y' = W[L] [0.5 0]^(L-1) X = 0.5^L # which will be very small
[0 0.5]
The last example explains that the activations (and similarly derivatives) will be
decreased/increased exponentially as a function of number of layers.
So If W > I (Identity matrix) the activation and gradients will explode.
And If W < I (Identity matrix) the activation and gradients will vanish.
Recently Microsoft trained 152 layers (ResNet)! which is a really big number. With such a deep
neural network, if your activations or gradients increase or decrease exponentially as a function
of L, then these values could get really big or really small. And this makes training difficult,
especially if your gradients are exponentially smaller than L, then gradient descent will take tiny
little steps. It will take a long time for gradient descent to learn anything.
There is a partial solution that doesn't completely solve this problem but it helps a lot - careful
choice of how you initialize the weights (next video).
np.random.rand(shape) * np.sqrt(1/n[l-1])
np.random.rand(shape) * np.sqrt(2/n[l-1])
Number 1 or 2 in the neumerator can also be a hyperparameter to tune (but not the first to start
with)
This is one of the best way of partially solution to Vanishing / Exploding gradients (ReLU +
Weight Initialization with variance) which will help gradients not to vanish/explode too quickly
The initialization in this video is called "He Initialization / Xavier Initialization" and has been
published in 2015 paper.
Gradient checking approximates the gradients and is very helpful for finding the errors in your
backpropagation implementation but it's slower than gradient descent (so use only for
debugging).
Implementation of this is very simple.
Gradient checking:
First take W[1],b[1],...,W[L],b[L] and reshape into one big vector ( theta )
The cost function will be J(theta)
Then take dW[1],db[1],...,dW[L],db[L] into one big vector ( d_theta )
Algorithm:
Initialization summary
The weights W[l] should be initialized randomly to break symmetry
It is however okay to initialize the biases b[l] to zeros. Symmetry is still broken so long as W[l] is
initialized randomly
Random initialization is used to break symmetry and make sure different hidden units can learn
different things
Regularization summary
1. L2 Regularization
Observations:
The value of λ is a hyperparameter that you can tune using a dev set.
L2 regularization makes your decision boundary smoother. If λ is too large, it is also possible to
"oversmooth", resulting in a model with high bias.
L2-regularization relies on the assumption that a model with small weights is simpler than a
model with large weights. Thus, by penalizing the square values of the weights in the cost
function you drive all the weights to smaller values. It becomes too costly for the cost to have
large weights! This leads to a smoother model in which the output changes more slowly as the
input changes.
What you should remember:
Implications of L2-regularization on:
cost computation:
A regularization term is added to the cost
backpropagation function:
There are extra terms in the gradients with respect to weight matrices
weights:
weights end up smaller ("weight decay") - are pushed to smaller values.
2. Dropout
Optimization algorithms
...
X{bs} = ...
Mini-batch size:
( mini batch size = m ) ==> Batch gradient descent
( mini batch size = 1 ) ==> Stochastic gradient descent (SGD)
( mini batch size = between 1 and m ) ==> Mini-batch gradient descent
Batch gradient descent:
too long per iteration (epoch)
Stochastic gradient descent:
too noisy regarding cost minimization (can be reduced by using smaller learning rate)
won't ever converge (reach the minimum cost)
lose speedup from vectorization
Mini-batch gradient descent:
i. faster learning:
you have the vectorization advantage
make progress without waiting to process the entire training set
ii. doesn't always exactly converge (oscelates in a very small region, but you can reduce
learning rate)
Guidelines for choosing mini-batch size:
i. If small training set (< 2000 examples) - use batch gradient descent.
ii. It has to be a power of 2 (because of the way computer memory is layed out and accessed,
sometimes your code runs faster if your mini-batch size is a power of 2): 64, 128, 256, 512,
1024, ...
t(1) = 40
t(2) = 49
t(3) = 45
...
t(180) = 60
...
This data is small in winter and big in summer. If we plot this data we will find it some noisy.
Now lets compute the Exponentially weighted averages:
V0 = 0
V1 = 0.9 * V0 + 0.1 * t(1) = 4 # 0.9 and 0.1 are hyperparameters
V2 = 0.9 * V1 + 0.1 * t(2) = 8.5
V3 = 0.9 * V2 + 0.1 * t(3) = 12.15
...
General equation
Best beta average for our case is between 0.9 and 0.98
Intuition: The reason why exponentially weighted averages are useful for further optimizing
gradient descent algorithm is that it can give different weights to recent data points ( theta )
based on value of beta . If beta is high (around 0.9), it smoothens out the averages of skewed
data points (oscillations w.r.t. Gradient descent terminology). So this reduces oscillations in
gradient descent and hence makes faster and smoother path towerds minima.
Another imagery example:
We can implement this algorithm with more accurate results using a moving window. But the
code is more efficient and faster using the exponentially weighted averages algorithm.
Algorithm is very simple:
v = 0
Repeat
{
Get theta(t)
v = beta * v + (1-beta) * theta(t)
}
vdW = 0, vdb = 0
on iteration t:
# can be mini-batch or batch gradient descent
compute dw, db on current mini-batch
Momentum helps the cost function to go to the minimum point in a more fast and consistent
way.
beta is another hyperparameter . beta = 0.9 is very common and works very well in most
cases.
In practice people don't bother implementing bias correction.
RMSprop
Stands for Root mean square prop.
This algorithm speeds up the gradient descent.
Pseudo code:
sdW = 0, sdb = 0
on iteration t:
# can be mini-batch or batch gradient descent
compute dw, db on current mini-batch
RMSprop will make the cost function move slower on the vertical direction and faster on the
horizontal direction in the following example:
Ensure that sdW is not zero by adding a small value epsilon (e.g. epsilon = 10^-8 ) to it:
W = W - learning_rate * dW / (sqrt(sdW) + epsilon)
vdW = 0, vdW = 0
sdW = 0, sdb = 0
on iteration t:
# can be mini-batch or batch gradient descent
compute dw, db on current mini-batch
Some people perform learning rate decay discretely - repeatedly decrease after some number of
epochs.
Some people are making changes to the learning rate manually.
decay_rate is another hyperparameter .
beta[1] , gamma[1] , ..., beta[L] , gamma[L] are updated using any optimization algorithms
(like GD, RMSprop, Adam)
If you are using a deep learning framework, you won't have to implement batch norm yourself:
Ex. in Tensorflow you can add this line: tf.nn.batch-normalization()
Batch normalization is usually applied with mini-batches.
If we are using batch normalization parameters b[1] , ..., b[L] doesn't count because they will
be eliminated after mean subtraction step, so:
beta[l] - (n[l], m)
gamma[l] - (n[l], m)
Softmax Regression
In every example we have used so far we were talking about binary classification.
There are a generalization of logistic regression called Softmax regression that is used for
multiclass classification/regression.
For example if we are classifying by classes dog , cat , baby chick and none of that
Dog class = 1
Cat class = 2
Baby chick class = 3
None class = 0
To represent a dog vector y = [0 1 0 0]
To represent a cat vector y = [0 0 1 0]
To represent a baby chick vector y = [0 0 0 1]
To represent a none vector y = [1 0 0 0]
Notations:
C = no. of classes
t = e^(Z[L]) # shape(C, m)
A[L] = e^(Z[L]) / sum(t) # shape(C, m), sum(t) - sum of t's for each example (sha
Training a Softmax classifier
There's an activation which is called hard max, which gets 1 for the maximum value and zeros for
the others.
If you are using NumPy, its np.max over the vertical axis.
The Softmax name came from softening the values and not harding them like hard max.
Softmax is a generalization of logistic activation function to C classes. If C = 2 softmax reduces
to logistic regression.
The loss function used with softmax:
dZ[L] = Y_hat - Y
Y_hat * (1 - Y_hat)
Example:
TensorFlow
In this section we will learn the basic structure of TensorFlow programs.
Lets see how to implement a minimization function:
Code v.1:
import numpy as np
import tensorflow as tf
for i in range(1000):
session.run(train)
Code v.2 (we feed the inputs to the algorithm through coefficients):
import numpy as np
import tensorflow as tf
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
session.run(w) # Runs the definition of w, if you print this it will print zero
session.run(train, feed_dict={x: coefficients})
for i in range(1000):
session.run(train, feed_dict={x: coefficients})
In TensorFlow you implement only the forward propagation and TensorFlow will do the
backpropagation by itself.
In TensorFlow a placeholder is a variable you can assign a value to later.
If you are using a mini-batch training you should change the feed_dict={x: coefficients} to
the current mini-batch data.
Almost all TensorFlow programs use this:
with tf.Session() as session: # better for cleaning up in case of error/exception
session.run(init)
session.run(w)
In deep learning frameworks there are a lot of things that you can do with one line of code like
changing the optimizer. Side notes:
Writing and running programs in TensorFlow has the following steps:
i. Create Tensors (variables) that are not yet executed/evaluated.
ii. Write operations between those Tensors.
iii. Initialize your Tensors.
iv. Create a Session.
v. Run the Session. This will run the operations you'd written above.
Instead of needing to write code to compute the cost function we know, we can use this line in
TensorFlow : tf.nn.sigmoid_cross_entropy_with_logits(logits = ..., labels = ...)
To initialize weights in NN using TensorFlow use:
For 3-layer NN, it is important to note that the forward propagation stops at Z3 . The reason is
that in TensorFlow the last linear layer output is given as input to the function computing the
loss. Therefore, you don't need A3 !
To reset the graph use tf.reset_default_graph()
Extra Notes
If you want a good papers in deep learning look at the ICLR proceedings (Or NIPS proceedings)
and that will give you a really good view of the field.
Who is Yuanqing Lin?
Head of Baidu research.
First one to win ImageNet
Works in PaddlePaddle deep learning platform.
These Notes were made by Mahmoud Badry @2017
Table of contents
Structuring Machine Learning Projects
Table of contents
Course summary
ML Strategy 1
Why ML Strategy
Orthogonalization
Single number evaluation metric
Satisfying and Optimizing metric
Train/dev/test distributions
Size of the dev and test sets
When to change dev/test sets and metrics
Why human-level performance?
Avoidable bias
Understanding human-level performance
Surpassing human-level performance
Improving your model performance
ML Strategy 2
Carrying out error analysis
Cleaning up incorrectly labeled data
Build your first system quickly, then iterate
Training and testing on different distributions
Bias and Variance with mismatched data distributions
Addressing data mismatch
Transfer learning
Multi-task learning
What is end-to-end deep learning?
Whether to use end-to-end deep learning
Course summary
Here are the course summary as its given on the course link:
You will learn how to build a successful machine learning project. If you aspire to be a technical
leader in AI, and know how to set direction for your team's work, this course will show you how.
Much of this content has never been taught elsewhere, and is drawn from my experience
building and shipping many deep learning products. This course also has two "flight simulators"
that let you practice decision-making as a machine learning project leader. This provides
"industry experience" that you might otherwise get only after years of ML work experience.
I've seen teams waste months or years through not understanding the principles taught in this
course. I hope this two week course will save you months of time.
This is a standalone course, and you can take this so long as you have basic machine learning
knowledge. This is the third course in the Deep Learning Specialization.
ML Strategy 1
Why ML Strategy
You have a lot of ideas for how to improve the accuracy of your deep learning system:
Collect more data.
Collect more diverse training set.
Train algorithm longer with gradient descent.
Try different optimization algorithm (e.g. Adam).
Try bigger network.
Try smaller network.
Try dropout.
Add L2 regularization.
Change network architecture (activation functions, # of hidden units, etc.)
This course will give you some strategies to help analyze your problem to go in a direction that
will help you get better results.
Orthogonalization
Some deep learning developers know exactly what hyperparameter to tune in order to try to
achieve one effect. This is a process we call orthogonalization.
In orthogonalization, you have some controls, but each control does a specific task and doesn't
affect other controls.
For a supervised learning system to do well, you usually need to tune the knobs of your system
to make sure that four things hold true - chain of assumptions in machine learning:
i. You'll have to fit training set well on cost function (near human level performance if
possible).
If it's not achieved you could try bigger network, another optimization algorithm (like
Adam)...
ii. Fit dev set well on cost function.
If its not achieved you could try regularization, bigger training set...
iii. Fit test set well on cost function.
If its not achieved you could try bigger dev. set...
iv. Performs well in real world.
If its not achieved you could try change dev. set, change cost function...
Suppose we run the classifier on 10 images which are 5 cats and 5 non-cats. The classifier
identifies that there are 4 cats, but it identified 1 wrong cat.
Confusion matrix:
Predicted cat Predicted non-cat
Actual cat 3 2
Actual non-cat 1 4
Recall: percentage of true recognition cat of the all cat predictions: R = 3/(3 + 2)
Accuracy: (3+4)/10
Using a precision/recall for evaluation is good in a lot of cases, but separately they don't tell you
which algothims is better. Ex:
A 95% 90%
B 98% 85%
A better thing is to combine precision and recall in one single (real) number evaluation metric.
There a metric called F1 score, which combines them
You can think of F1 score as an average of precision and recall F1 = 2 / ((1/P) + (1/R))
A 90% 80 ms
B 92% 95 ms
C 92% 1,500 ms
So we can solve that by choosing a single optimizing metric and decide that other metrics are
satisfying. Ex:
So as a general rule:
Algorithm A 3% error (But a lot of porn images are treated as cat images here)
Algorithm B 5% error
In the last example if we choose the best algorithm by metric it would be "A", but if the
users decide it will be "B"
Thus in this case, we want and need to change our metric.
OldMetric = (1/m) * sum(y_pred[i] != y[i] ,m)
Where m is the number of Dev set items.
NewMetric = (1/sum(w[i])) * sum(w[i] * (y_pred[i] != y[i]) ,m)
where:
w[i] = 1 if x[i] is not porn
This is actually an example of an orthogonalization where you should take a machine learning
problem and break it into distinct steps:
i. Figure out how to define a metric that captures what you want to do - place the target.
ii. Worry about how to actually do well on this metric - how to aim/shoot accurately at the
target.
Conclusion: if doing well on your metric + dev/test set doesn't correspond to doing well in your
application, change your metric and/or dev/test set.
Avoidable bias
Suppose that the cat classification algorithm gives these results:
Humans 1% 7.5%
Training error 8% 8%
Humans 1% 7.5%
In the left example, because the human level error is 1% then we have to focus on the bias.
In the right example, because the human level error is 7.5% then we have to focus on the
variance.
The human-level error as a proxy (estimate) for Bayes optimal error. Bayes optimal error is
always less (better), but human-level in most cases is not far from it.
You can't do better than Bayes error unless you are overfitting.
Avoidable bias = Training error - Human (Bayes) error
ML Strategy 2
In the cat classification example, if you have 10% error on your dev set and you want to
decrease the error.
You discovered that some of the mislabeled data are dog pictures that look like cats. Should
you try to make your cat classifier do better on dogs (this could take some weeks)?
Error analysis approach:
Get 100 mislabeled dev set examples at random.
Count up how many are dogs.
if 5 of 100 are dogs then training your classifier to do better on dogs will decrease your
error up to 9.5% (called ceiling), which can be too little.
if 50 of 100 are dogs then you could decrease your error up to 5%, which is reasonable
and you should work on that.
Based on the last example, error analysis helps you to analyze the error before taking an action
that could take lot of time with no need.
Sometimes, you can evaluate multiple error analysis ideas in parallel and choose the best idea.
Create a spreadsheet to do that and decide, e.g.:
1 ✓ ✓ Pitbull
2 ✓ ✓ ✓
4 ✓
....
In the last example you will decide to work on great cats or blurry images to improve your
performance.
This quick counting procedure, which you can often do in, at most, small numbers of hours can
really help you make much better prioritization decisions, and understand how promising
different approaches are to work on.
If you want to check for mislabeled data in dev/test set, you should also try error analysis with
the mislabeled column. Ex:
1 ✓
2 ✓ ✓
4 ✓
Image Dog Great Cats blurry Mislabeled Comments
....
Then:
If overall dev set error: 10%
Then errors due to incorrect data: 0.6%
Then errors due to other causes: 9.4%
Then you should focus on the 9.4% error rather than the incorrect data.
Apply the same process to your dev and test sets to make sure they continue to come from
the same distribution.
Consider examining examples your algorithm got right as well as ones it got wrong. (Not
always done if you reached a good accuracy)
Train and (dev/test) data may now come from a slightly different distributions.
It's very important to have dev and test sets to come from the same distribution. But it could
be OK for a train set to come from slightly other distribution.
1. Carry out manual error analysis to try to understand the difference between training and dev/test
sets.
2. Make training data more similar, or collect more data similar to dev/test sets.
If your goal is to make the training data more similar to your dev set one of the techniques you
can use Artificial data synthesis that can help you make more training data.
Combine some of your training data with something that can convert it to the dev/test set
distribution.
Examples:
a. Combine normal audio with car noise to get audio with car noise example.
b. Generate cars using 3D graphics in a car classification example.
Be cautious and bear in mind whether or not you might be accidentally simulating data only
from a tiny subset of the space of all possible examples because your NN might overfit
these generated data (like particular car noise or a particular design of 3D graphics cars).
Transfer learning
Apply the knowledge you took in a task A and apply it in another task B.
For example, you have trained a cat classifier with a lot of data, you can use the part of the
trained NN it to solve x-ray classification problem.
To do transfer learning, delete the last layer of NN and it's weights and:
i. Option 1: if you have a small data set - keep all the other weights as a fixed weights. Add a
new last layer(-s) and initialize the new layer weights and feed the new data to the NN and
learn the new weights.
ii. Option 2: if you have enough data you can retrain all the weights.
Option 1 and 2 are called fine-tuning and training on task A called pretraining.
When transfer learning make sense:
Task A and B have the same input X (e.g. image, audio).
You have a lot of data for the task A you are transferring from and relatively less data for the
task B your transferring to.
Low level features from task A could be helpful for learning task B.
Multi-task learning
Whereas in transfer learning, you have a sequential process where you learn from task A and
then transfer that to task B. In multi-task learning, you start off simultaneously, trying to have
one neural network do several things at the same time. And then each of these tasks helps
hopefully all of the other tasks.
Example:
You want to build an object recognition system that detects pedestrians, cars, stop signs,
and traffic lights (image has multiple labels).
Then Y shape will be (4,m) because we have 4 classes and each one is a binary one.
Then
Cost = (1/m) * sum(sum(L(y_hat(i)_j, y(i)_j))), i = 1..m, j = 1..4 , where
L = - y(i)_j * log(y_hat(i)_j) - (1 - y(i)_j) * log(1 - y_hat(i)_j)
In the last example you could have trained 4 neural networks separately but if some of the earlier
features in neural network can be shared between these different types of objects, then you find
that training one neural network to do four things results in better performance than training 4
completely separate neural networks to do the four tasks separately.
Multi-task learning will also work if y isn't complete for some labels. For example:
Y = [1 ? 1 ...]
[0 0 1 ...]
[? 1 ? ...]
And in this case it will do good with the missing data, just the loss function will be different:
Loss = (1/m) * sum(sum(L(y_hat(i)_j, y(i)_j) for all j which y(i)_j != ?))
Audio > Features > Phonemes > Words > Transcript # non end to end system
Audio ---> Features --> Phonemes --> Words --> Transcript # non-end-to-end system
Audio ---------------------------------------> Transcript # end-to-end deep learnin
End-to-end deep learning gives data more freedom, it might not use phonemes when
training!
To build the end-to-end deep learning system that works well, we need a big dataset (more data
then in non end-to-end system). If we have a small dataset the ordinary implementation could
work just fine.
Example 2:
Face recognition system:
English --> Text analysis --> ... --> French # non-end-to-end system
English ----------------------------> French # end-to-end deep learning system - be
Here end-to-end deep leaning system works better because we have enough data to build
it.
Example 4:
Estimating child's age from the x-ray picture of a hand:
Image --> Bones --> Age # non-end-to-end system - best approach for now
Image ------------> Age # end-to-end system
In this example non-end-to-end system works better because we don't have enough data to
train end-to-end system.
Table of contents
Convolutional Neural Networks
Table of contents
Course summary
Foundations of CNNs
Computer vision
Edge detection example
Padding
Strided convolution
Convolutions over volumes
One Layer of a Convolutional Network
A simple convolution network example
Pooling layers
Convolutional neural network example
Why convolutions?
Deep convolutional models: case studies
Why look at case studies?
Classic networks
Residual Networks (ResNets)
Why ResNets work
Network in Network and 1×1 convolutions
Inception network motivation
Inception network (GoogleNet)
Using Open-Source Implementation
Transfer Learning
Data Augmentation
State of Computer Vision
Object detection
Object Localization
Landmark Detection
Object Detection
Convolutional Implementation of Sliding Windows
Bounding Box Predictions
Intersection Over Union
Non-max Suppression
Anchor Boxes
YOLO Algorithm
Region Proposals (R-CNN)
Special applications: Face recognition & Neural style transfer
Face Recognition
What is face recognition?
One Shot Learning
Siamese Network
Triplet Loss
Face Verification and Binary Classification
Neural Style Transfer
What is neural style transfer?
What are deep ConvNets learning?
Cost Function
Content Cost Function
Style Cost Function
1D and 3D Generalizations
Extras
Keras
Course summary
Here is the course summary as given on the course link:
This course will teach you how to build convolutional neural networks and apply it to image
data. Thanks to deep learning, computer vision is working far better than just two years ago, and
this is enabling numerous exciting applications ranging from safe autonomous driving, to
accurate face recognition, to automatic reading of radiology images.
You will:
Understand how to build a convolutional neural network, including recent variations such as
residual networks.
Know how to apply convolutional networks to visual detection and recognition tasks.
Know to use neural style transfer to generate art.
Be able to apply these algorithms to a variety of image, video, and other 2D or 3D data.
Foundations of CNNs
Learn to implement the foundational layers of CNNs (pooling, convolutions) and to stack them
properly in a deep network to solve multi-class image classification problems.
Computer vision
Computer vision is one of the applications that are rapidly active thanks to deep learning.
Some of the applications of computer vision that are using deep learning includes:
Self driving cars.
Face recognition.
Deep learning is also enabling new types of art to be created.
Rapid changes to computer vision are making new applications that weren't possible a few years
ago.
Computer vision deep leaning techniques are always evolving making a new architectures which
can help us in other areas other than computer vision.
For example, Andrew Ng took some ideas of computer vision and applied it in speech
recognition.
Examples of a computer vision problems includes:
Image classification.
Object detection.
Detect object and localize them.
Neural style transfer
Changes the style of an image using another image.
One of the challenges of computer vision problem that images can be so large and we want a
fast and accurate algorithm to work with that.
For example, a 1000x1000 image will represent 3 million feature/input to the full connected
neural network. If the following hidden layer contains 1000, then we will want to learn
weights of the shape [1000, 3 million] which is 3 billion parameter only in the first layer
and thats so computationally expensive!
One of the solutions is to build this using convolution layers instead of the fully connected
layers.
Early layers of CNN might detect edges then the middle layers will detect parts of objects and
the later layers will put the these parts together to produce an output.
In an image we can detect vertical edges, horizontal edges, or full edge detector.
In the last example a 6x6 matrix convolved with 3x3 filter/kernel gives us a 4x4 matrix.
If you make the convolution operation in TensorFlow you will find the function
tf.nn.conv2d . In keras you will find Conv2d function.
The vertical edge detection filter will find a 3x3 place in an image where there are a bright
region followed by a dark region.
If we applied this filter to a white region followed by a dark region, it should find the edges
in between the two colors as a positive value. But if we applied the same filter to a dark
region followed by a white region it will give us negative values. To solve this we can use the
abs function to make it positive.
1 1 1
0 0 0
-1 -1 -1
There are a lot of ways we can put number inside the horizontal or vertical edge detections. For
example here are the vertical Sobel filter (The idea is taking care of the middle row):
1 0 -1
2 0 -2
1 0 -1
Also something called Scharr filter (The idea is taking great care of the middle row):
3 0 -3
10 0 -10
3 0 -3
What we learned in the deep learning is that we don't need to hand craft these numbers, we can
treat them as weights and then learn them. It can learn horizontal, vertical, angled, or any edge
type automatically rather than getting them by hand.
Padding
In order to to use deep neural networks we really need to use paddings.
In the last section we saw that a 6x6 matrix convolved with 3x3 filter/kernel gives us a 4x4
matrix.
To give it a general rule, if a matrix nxn is convolved with fxf filter/kernel give us n-f+1,n-f+1
matrix.
We want to apply convolution operation multiple times, but if the image shrinks we will lose a lot
of data on this process. Also the edges pixels are used less than other pixels in an image.
Shrinks output.
throwing away a lot of information that are in the edges.
To solve these problems we can pad the input image before convolution by adding some rows
and columns to it. We will call the padding amount P the number of row/columns that we will
insert in top, bottom, left and right of the image.
The general rule now, if a matrix nxn is convolved with fxf filter/kernel and padding p give us
n+2p-f+1,n+2p-f+1 matrix.
If n = 6, f = 3, and p = 1 Then the output image will have n+2p-f+1 = 6+2-3+1 = 6 . We maintain
the size of the image.
Same convolutions is a convolution with a pad so that output size is the same as the input size.
Its given by the equation:
P = (f-1) / 2
In computer vision f is usually odd. Some of the reasons is that its have a center value.
Strided convolution
Strided convolution is another piece that are used in CNNs.
When we are making the convolution operation we used S to tell us the number of pixels we
will jump when we are convolving filter/kernel. The last examples we described S was 1.
if a matrix nxn is convolved with fxf filter/kernel and padding p and stride s it give us
(n+2p-f)/s + 1,(n+2p-f)/s + 1 matrix.
In math textbooks the conv operation is filpping the filter before using it. What we were doing is
called cross-correlation operation but the state of art of deep learning is using this as conv
operation.
Same convolutions is a convolution with a padding so that output size is the same as the input
size. Its given by the equation:
p = (n*s - n + f - s) / 2
When s = 1 ==> P = (f-1) / 2
Hint: no matter the size of the input, the number of the parameters is same if filter size is same.
That makes it less prone to overfitting.
Hyperparameters
f[l] = filter size
p[l] = padding # Default is zero
s[l] = stride
nc[l] = number of filters
number of filters = 10
number of filters = 20
number of filters = 40
In the last example you seen that the image are getting smaller after each layer and thats the
trend now.
Types of layer in a convolutional network:
Convolution. #Conv
Pooling #Pool
Fully connected #FC
Pooling layers
Other than the conv layers, CNNs often uses pooling layers to reduce the size of the inputs,
speed up computation, and to make some of the features it detects more robust.
Max pooling example:
This example has f = 2 , s = 2 , and p = 0 hyperparameters
The max pooling is saying, if the feature is detected anywhere in this filter then keep a high
number. But the main reason why people are using pooling because its works well in practice
and reduce computations.
Max pooling has no parameters to learn.
Example of Max pooling on 3D input:
Input: 4x4x10
Max pooling size = 2 and stride = 2
Output: 2x2x10
Average pooling is taking the averages of the values instead of taking the max values.
Max pooling is used more often than average pooling in practice.
If stride of pooling equals the size, it will then apply the effect of shrinking.
Hyperparameters summary
f : filter size.
s : stride.
Padding are rarely uses here.
Max or average pooling.
number of filters = 16
Hyperparameters are a lot. For choosing the value of each you should follow the guideline that
we will discuss later or check the literature and takes some ideas and numbers from it.
Usually the input size decreases over layers while the number of filters increases.
A CNN usually consists of one or more convolution (Not just one as the shown examples)
followed by a pooling.
Fully connected layers has the most parameters in the network.
To consider using these blocks together you should look at other working examples firsts to get
some intuitions.
Why convolutions?
Two main advantages of Convs are:
Parameter sharing.
A feature detector (such as a vertical edge detector) that's useful in one part of the
image is probably useful in another part of the image.
sparsity of connections.
In each layer, each output value depends only on a small number of inputs which makes
it translation invariance.
Putting it all together:
Classic networks
In this section we will talk about classic networks which are LeNet-5, AlexNet, and VGG.
LeNet-5
The goal for this model was to identify handwritten digits in a 32x32x1 gray image. Here are
the drawing of it:
This model was published in 1998. The last layer wasn't using softmax back then.
It has 60k parameters.
The dimensions of the image decreases as the number of channels increases.
Conv ==> Pool ==> Conv ==> Pool ==> FC ==> FC ==> softmax this type of arrangement is
quite common.
The activation function used in the paper was Sigmoid and Tanh. Modern implementation
uses RELU in most of the cases.
[LeCun et al., 1998. Gradient-based learning applied to document recognition]
AlexNet
Named after Alex Krizhevsky who was the first author of this paper. The other authors
includes Geoffrey Hinton.
The goal for the model was the ImageNet challenge which classifies images into 1000
classes. Here are the drawing of the model:
Summary:
Conv => Max-pool => Conv => Max-pool => Conv => Conv => Conv => Max-pool ==> Flat
The original paper contains Multiple GPUs and Local Response normalization (RN).
Multiple GPUs were used because the GPUs were not so fast back then.
Researchers proved that Local Response normalization doesn't help much so for now
don't bother yourself for understanding or implementing it.
This paper convinced the computer vision researchers that deep learning is so important.
[Krizhevsky et al., 2012. ImageNet classification with deep convolutional neural networks]
VGG-16
These networks can go deeper without hurting the performance. In the normal NN - Plain
networks - the theory tell us that if we go deeper we will get a better solution to our
problem, but because of the vanishing and exploding gradients problems the performance
of the network suffers as it goes deeper. Thanks to Residual Network we can go deeper as
we want now.
On the left is the normal NN and on the right are the ResNet. As you can see the
performance of ResNet increases as the network goes deeper.
In some cases going deeper won't effect the performance and that depends on the problem
on your hand.
Some people are trying to train 1000 layer now which isn't used in practice.
[He et al., 2015. Deep residual networks for image recognition]
X --> Big NN --> a[l] --> Layer1 --> Layer2 --> a[l+2]
Then:
Then if we are using L2 regularization for example, W[l+2] will be zero. Lets say that
b[l+2] will be zero too.
This show that identity function is easy for a residual block to learn. And that why it can train
deeper NNs.
Also that the two layers we added doesn't hurt the performance of big NN we made.
Hint: dimensions of z[l+2] and a[l] have to be the same in resNets. In case they have
different dimensions what we put a matrix parameters (Which can be learned or fixed)
a[l+2] = g( z[l+2] + ws * a[l] ) # The added Ws should make the dimensions equal
Using a skip-connection helps the gradient to backpropagate and thus helps you to train deeper
networks
Identity block:
Hint the conv is followed by a batch norm BN before RELU . Dimensions here are same.
This skip is over 2 layers. The skip connection can jump n connections where n>2
This drawing represents Keras layers.
The convolutional block:
The conv can be bottleneck 1 x 1 conv
It has been used in a lot of modern CNN implementations like ResNet and Inception models.
We want to shrink the number of channels. We also call this feature transformation.
In the second discussed example above we have shrinked the input from 32 to 5
channels.
We will later see that by shrinking it we can save a lot of computations.
If we have specified the number of 1 x 1 Conv filters to be the same as the input number of
channels then the output will contain the same number of channels. Then the 1 x 1 Conv will
act like a non linearity and will learn non linearity operator.
Replace fully connected layers with 1 x 1 convolutions as Yann LeCun believes they are the same.
In Convolutional Nets, there is no such thing as "fully-connected layers". There are only
convolution layers with 1x1 convolution kernels and a full connection table. Yann
LeCun
[Lin et al., 2013. Network in network]
Transfer Learning
If you are using a specific NN architecture that has been trained before, you can use this
pretrained parameters/weights instead of random initialization to solve your problem.
It can help you boost the performance of the NN.
The pretrained models might have trained on a large datasets like ImageNet, Ms COCO, or
pascal and took a lot of time to learn those parameters/weights with optimized hyperparameters.
This can save you a lot of time.
Lets see an example:
Lets say you have a cat classification problem which contains 3 classes Tigger, Misty and
neither.
You don't have much a lot of data to train a NN on these images.
Andrew recommends to go online and download a good NN with its weights, remove the
softmax activation layer and put your own one and make the network learn only the new
layer while other layer weights are fixed/frozen.
Frameworks have options to make the parameters frozen in some layers using trainable =
0 or freeze = 0
One of the tricks that can speed up your training, is to run the pretrained NN without final
softmax layer and get an intermediate representation of your images and save them to disk.
And then use these representation to a shallow NN network. This can save you the time
needed to run an image through all the layers.
Its like converting your images into vectors.
Another example:
What if in the last example you have a lot of pictures for your cats.
One thing you can do is to freeze few layers from the beginning of the pretrained network
and learn the other weights in the network.
Some other idea is to throw away the layers that aren't frozen and put your own layers
there.
Another example:
If you have enough data, you can fine tune all the layers in your pretrained network but
don't random initialize the parameters, leave the learned parameters as it is and learn from
there.
Data Augmentation
If data is increased, your deep NN will perform better. Data augmentation is one of the
techniques that deep learning uses to increase the performance of deep NN.
The majority of computer vision applications needs more data right now.
Some data augmentation methods that are used for computer vision tasks includes:
Mirroring.
Random cropping.
The issue with this technique is that you might take a wrong crop.
The solution is to make your crops big enough.
Rotation.
Shearing.
Local warping.
Color shifting.
For example, we add to R, G, and B some distortions that will make the image identified
as the same for the human but is different for the computer.
In practice the added value are pulled from some probability distribution and these
shifts are some small.
Makes your algorithm more robust in changing colors in images.
There are an algorithm which is called PCA color augmentation that decides the shifts
needed automatically.
Implementing distortions during training:
You can use a different CPU thread to make you a distorted mini batches while you are
training your NN.
Data Augmentation has also some hyperparameters. A good place to start is to find an open
source data augmentation implementation and then use it or fine tune these hyperparameters.
Object detection
Learn how to apply your knowledge of CNNs to one of the toughest but hottest field of
computer vision: Object detection.
Object Localization
Object detection is one of the areas in which deep learning is doing great in the past two years.
Object detection:
Given an image we want to detect all the object in the image that belong to a specific
classes and give their location. An image can contain more than one object with
different classes.
Semantic Segmentation:
We want to Label each pixel in the image with a category label. Semantic Segmentation
Don't differentiate instances, only care about pixels. It detects no objects just pixels.
If there are two objects of the same class is intersected, we won't be able to separate
them.
Instance Segmentation
This is like the full problem. Rather than we want to predict the bounding box, we want
to know which pixel label but also distinguish them.
To make image classification we use a Conv Net with a Softmax attached to the end of it.
To make classification with localization we use a Conv Net with a softmax attached to the end of
it and a four numbers bx , by , bh , and bw to tell you the location of the class in the image.
The dataset should contain this four numbers with the class too.
Y = [
Pc # Probability of an object is presented
bx # Bounding box
by # Bounding box
bh # Bounding box
bw # Bounding box
c1 # The classes
c2
...
]
Y = [
1 # Object is present
0
0
100
100
0
1
0
]
Y = [
0 # Object isn't presented
? # ? means we dont care with other values
?
?
?
?
?
?
]
The loss function for the Y we have created (Example of the square error):
L(y',y) = {
(y1'-y1)^2 + (y2'-y2)^2 + ... if y1 = 1
(y1'-y1)^2 if y1 = 0
}
In practice we use logistic regression for pc , log likely hood loss for classes, and squared
error for the bounding box.
Landmark Detection
In some of the computer vision problems you will need to output some points. That is called
landmark detection.
For example, if you are working in a face recognition problem you might want some points on
the face like corners of the eyes, corners of the mouth, and corners of the nose and so on. This
can help in a lot of application like detecting the pose of the face.
Y shape for the face recognition problem that needs to output 64 landmarks:
Y = [
THereIsAface # Probability of face is presented 0 or 1
l1x,
l1y,
....,
l64x,
l64y
]
Another application is when you need to get the skeleton of the person using different
landmarks/points in the person which helps in some applications.
Hint, in your labeled data, if l1x,l1y is the left corner of left eye, all other l1x,l1y of the other
examples has to be the same.
Object Detection
We will use a Conv net to solve the object detection problem using a technique called the sliding
windows detection algorithm.
For example lets say we are working on Car object detection.
The first thing, we will train a Conv net on cropped car images and non car images.
After we finish training of this Conv net we will then use it with the sliding windows technique.
Sliding windows detection algorithm:
i. Decide a rectangle size.
ii. Split your image into rectangles of the size you picked. Each region should be covered. You
can use some strides.
iii. For each rectangle feed the image into the Conv net and decide if its a car or not.
iv. Pick larger/smaller rectangles and repeat the process from 2 to 3.
v. Store the rectangles that contains the cars.
vi. If two or more rectangles intersects choose the rectangle with the best accuracy.
Disadvantage of sliding window is the computation time.
In the era of machine learning before deep learning, people used a hand crafted linear classifiers
that classifies the object and then use the sliding window technique. The linear classier make it a
cheap computation. But in the deep learning era that is so computational expensive due to the
complexity of the deep learning model.
To solve this problem, we can implement the sliding windows with a Convolutional approach.
One other idea is to compress your deep learning model.
Convolutional Implementation of Sliding Windows
Turning FC layer into convolutional layers (predict image class from four classes):
As you can see in the above image, we turned the FC layer into a Conv layer using a
convolution with the width and height of the filter is the same as the width and height of the
input.
Convolution implementation of sliding windows:
First lets consider that the Conv net you trained is like this (No FC all is conv layers):
Say now we have a 16 x 16 x 3 image that we need to apply the sliding windows in. By the
normal implementation that have been mentioned in the section before this, we would run
this Conv net four times each rectangle size will be 16 x 16.
The convolution implementation will be as follows:
Simply we have feed the image into the same Conv net we have trained.
The left cell of the result "The blue one" will represent the the first sliding window of the
normal implementation. The other cells will represent the others.
Its more efficient because it now shares the computations of the four times needed.
Another example would be:
This example has a total of 16 sliding windows that shares the computation together.
[Sermanet et al., 2014, OverFeat: Integrated recognition, localization and detection using
convolutional networks]
The weakness of the algorithm is that the position of the rectangle wont be so accurate. Maybe
none of the rectangles is exactly on the object you want to recognize.
In red, the rectangle we want and in blue is the required car rectangle.
YOLO stands for you only look once and was developed back in 2015.
Yolo Algorithm:
i. Lets say we have an image of 100 X 100
ii. Place a 3 x 3 grid on the image. For more smother results you should use 19 x 19 for the 100
x 100
iii. Apply the classification and localization algorithm we discussed in a previous section to each
section of the grid. bx and by will represent the center point of the object in each grid and
will be relative to the box so the range is between 0 and 1 while bh and bw will represent
the height and width of the object which can be greater than 1.0 but still a floating point
value.
iv. Do everything at once with the convolution sliding window. If Y shape is 1 x 8 as we
discussed before then the output of the 100 x 100 image should be 3 x 3 x 8 which
corresponds to 9 cell results.
v. Merging the results using predicted localization mid point.
We have a problem if we have found more than one object in one grid box.
One of the best advantages that makes the YOLO algorithm popular is that it has a great speed
and a Conv net implementation.
How is YOLO different from other Object detectors? YOLO uses a single CNN network for both
classification and localizing the object using bounding boxes.
In the next sections we will see some ideas that can make the YOLO algorithm better.
Non-max Suppression
One of the problems we have addressed in YOLO is that it can detect an object multiple times.
Non-max Suppression is a way to make sure that YOLO detects the object just once.
For example:
Each car has two or more detections with different probabilities. This came from some of the
grids that thinks that this is the center point of the object.
Non-max suppression algorithm:
i. Lets assume that we are targeting one class as an output class.
ii. Y shape should be [Pc, bx, by, bh, hw] Where Pc is the probability if that object occurs.
iii. Discard all boxes with Pc < 0.6
iv. While there are any remaining boxes:
a. Pick the box with the largest Pc Output that as a prediction.
b. Discard any remaining box with IoU > 0.5 with that box output in the previous step i.e
any box with high overlap(greater than overlap threshold of 0.5).
If there are multiple classes/object types c you want to detect, you should run the Non-max
suppression c times, once for every output class.
Anchor Boxes
In YOLO, a grid only detects one object. What if a grid cell wants to detect multiple object?
So Previously, each object in training image is assigned to grid cell that contains that object's
midpoint.
With two anchor boxes, Each object in training image is assigned to grid cell that contains
object's midpoint and anchor box for the grid cell with highest IoU. You have to check where
your object should be based on its rectangle closest to which anchor box.
Example of data:
YOLO Algorithm
YOLO is a state-of-the-art object detection model that is fast and accurate
Lets sum up and introduce the whole YOLO algorithm given an example.
Suppose we need to do object detection for our autonomous driver system.It needs to identify
three classes:
i. Pedestrian (Walks on ground).
ii. Car.
iii. Motorcycle.
We decided to choose two anchor boxes, a taller one and a wide one.
Like we said in practice they use five or more anchor boxes hand made or generated using
k-means.
Our labeled Y shape will be [Ny, HeightOfGrid, WidthOfGrid, 16] , where Ny is number of
instances and each row (of size 16) is as follows:
[Pc, bx, by, bh, bw, c1, c2, c3, Pc, bx, by, bh, bw, c1, c2, c3]
Your dataset could be an image with a multiple labels and a rectangle for each label, we should
go to your dataset and make the shape and values of Y like we agreed.
An example:
We first initialize all of them to zeros and ?, then for each label and rectangle choose its
closest grid point then the shape to fill it and then the best anchor point based on the IOU.
so that the shape of Y for one image should be [HeightOfGrid, WidthOfGrid,16]
Train the labeled images on a Conv net. you should receive an output of [HeightOfGrid,
WidthOfGrid,16] for our case.
To make predictions, run the Conv net on an image and run Non-max suppression algorithm for
each class you have in our case there are 3 classes.
Summary:
______________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
======================================================================================
input_1 (InputLayer) (None, 608, 608, 3) 0
______________________________________________________________________________________
conv2d_1 (Conv2D) (None, 608, 608, 32) 864 input_1[0][0]
______________________________________________________________________________________
batch_normalization_1 (BatchNorm (None, 608, 608, 32) 128 conv2d_1[0][0]
______________________________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 608, 608, 32) 0 batch_normalization_1[0]
______________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D) (None, 304, 304, 32) 0 leaky_re_lu_1[0][0
______________________________________________________________________________________
conv2d_2 (Conv2D) (None, 304, 304, 64) 18432 max_pooling2d_1[0]
______________________________________________________________________________________
batch_normalization_2 (BatchNorm (None, 304, 304, 64) 256 conv2d_2[0][0]
______________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 304, 304, 64) 0 batch_normalization_2[0]
______________________________________________________________________________________
max_pooling2d_2 (MaxPooling2D) (None, 152, 152, 64) 0 leaky_re_lu_2[0][0
______________________________________________________________________________________
conv2d_3 (Conv2D) (None, 152, 152, 128) 73728 max_pooling2d_2[0]
______________________________________________________________________________________
batch_normalization_3 (BatchNorm (None, 152, 152, 128) 512 conv2d_3[0][0]
______________________________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None, 152, 152, 128) 0 batch_normalization_3[0]
______________________________________________________________________________________
conv2d_4 (Conv2D) (None, 152, 152, 64) 8192 leaky_re_lu_3[0][0
______________________________________________________________________________________
batch_normalization_4 (BatchNorm (None, 152, 152, 64) 256 conv2d_4[0][0]
______________________________________________________________________________________
leaky_re_lu_4 (LeakyReLU) (None, 152, 152, 64) 0 batch_normalization_4[0]
______________________________________________________________________________________
conv2d_5 (Conv2D) (None, 152, 152, 128) 73728 leaky_re_lu_4[0][0
______________________________________________________________________________________
batch_normalization_5 (BatchNorm (None, 152, 152, 128) 512 conv2d_5[0][0]
______________________________________________________________________________________
leaky_re_lu_5 (LeakyReLU) (None, 152, 152, 128) 0 batch_normalization_5[0]
______________________________________________________________________________________
max_pooling2d_3 (MaxPooling2D) (None, 76, 76, 128) 0 leaky_re_lu_5[0][0
______________________________________________________________________________________
conv2d_6 (Conv2D) (None, 76, 76, 256) 294912 max_pooling2d_3[0]
______________________________________________________________________________________
batch_normalization_6 (BatchNorm (None, 76, 76, 256) 1024 conv2d_6[0][0]
______________________________________________________________________________________
leaky_re_lu_6 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_6[0]
______________________________________________________________________________________
conv2d_7 (Conv2D) (None, 76, 76, 128) 32768 leaky_re_lu_6[0][0
______________________________________________________________________________________
batch_normalization_7 (BatchNorm (None, 76, 76, 128) 512 conv2d_7[0][0]
______________________________________________________________________________________
leaky_re_lu_7 (LeakyReLU) (None, 76, 76, 128) 0 batch_normalization_7[0]
______________________________________________________________________________________
conv2d_8 (Conv2D) (None, 76, 76, 256) 294912 leaky_re_lu_7[0][0
______________________________________________________________________________________
batch_normalization_8 (BatchNorm (None, 76, 76, 256) 1024 conv2d_8[0][0]
______________________________________________________________________________________
leaky_re_lu_8 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_8[0]
______________________________________________________________________________________
max_pooling2d_4 (MaxPooling2D) (None, 38, 38, 256) 0 leaky_re_lu_8[0][0
______________________________________________________________________________________
conv2d_9 (Conv2D) (None, 38, 38, 512) 1179648 max_pooling2d_4[0]
______________________________________________________________________________________
batch_normalization_9 (BatchNorm (None, 38, 38, 512) 2048 conv2d_9[0][0]
______________________________________________________________________________________
leaky_re_lu_9 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_9[0]
______________________________________________________________________________________
conv2d_10 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_9[0][0
______________________________________________________________________________________
batch_normalization_10 (BatchNor (None, 38, 38, 256) 1024 conv2d_10[0][0]
______________________________________________________________________________________
leaky_re_lu_10 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_10[0]
______________________________________________________________________________________
conv2d_11 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_10[0][0
______________________________________________________________________________________
batch_normalization_11 (BatchNor (None, 38, 38, 512) 2048 conv2d_11[0][0]
______________________________________________________________________________________
leaky_re_lu_11 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_11[0]
______________________________________________________________________________________
conv2d_12 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_11[0][0
______________________________________________________________________________________
batch_normalization_12 (BatchNor (None, 38, 38, 256) 1024 conv2d_12[0][0]
______________________________________________________________________________________
leaky_re_lu_12 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_12[0][0
______________________________________________________________________________________
conv2d_13 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_12[0][0
______________________________________________________________________________________
batch_normalization_13 (BatchNor (None, 38, 38, 512) 2048 conv2d_13[0][0]
______________________________________________________________________________________
leaky_re_lu_13 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_13[0]
______________________________________________________________________________________
max_pooling2d_5 (MaxPooling2D) (None, 19, 19, 512) 0 leaky_re_lu_13[0][0
______________________________________________________________________________________
conv2d_14 (Conv2D) (None, 19, 19, 1024) 4718592 max_pooling2d_5[0]
______________________________________________________________________________________
batch_normalization_14 (BatchNor (None, 19, 19, 1024) 4096 conv2d_14[0][0]
______________________________________________________________________________________
leaky_re_lu_14 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_14[0]
______________________________________________________________________________________
conv2d_15 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_14[0][0
______________________________________________________________________________________
batch_normalization_15 (BatchNor (None, 19, 19, 512) 2048 conv2d_15[0][0]
______________________________________________________________________________________
leaky_re_lu_15 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_15[0]
______________________________________________________________________________________
conv2d_16 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_15[0][0
______________________________________________________________________________________
batch_normalization_16 (BatchNor (None, 19, 19, 1024) 4096 conv2d_16[0][0]
______________________________________________________________________________________
leaky_re_lu_16 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_16[0]
______________________________________________________________________________________
conv2d_17 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_16[0][0
______________________________________________________________________________________
batch_normalization_17 (BatchNor (None, 19, 19, 512) 2048 conv2d_17[0][0]
______________________________________________________________________________________
leaky_re_lu_17 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_17[0]
______________________________________________________________________________________
conv2d_18 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_17[0][0
______________________________________________________________________________________
batch_normalization_18 (BatchNor (None, 19, 19, 1024) 4096 conv2d_18[0][0]
______________________________________________________________________________________
leaky_re_lu_18 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_18[0]
______________________________________________________________________________________
conv2d_19 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_18[0][0
______________________________________________________________________________________
batch_normalization_19 (BatchNor (None, 19, 19, 1024) 4096 conv2d_19[0][0]
______________________________________________________________________________________
conv2d_21 (Conv2D) (None, 38, 38, 64) 32768 leaky_re_lu_13[0][0
______________________________________________________________________________________
leaky_re_lu_19 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_19[0]
______________________________________________________________________________________
batch_normalization_21 (BatchNor (None, 38, 38, 64) 256 conv2d_21[0][0]
______________________________________________________________________________________
conv2d_20 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_19[0][0
______________________________________________________________________________________
leaky_re_lu_21 (LeakyReLU) (None, 38, 38, 64) 0 batch_normalization_21[0]
______________________________________________________________________________________
batch_normalization_20 (BatchNor (None, 19, 19, 1024) 4096 conv2d_20[0][0]
______________________________________________________________________________________
space_to_depth_x2 (Lambda) (None, 19, 19, 256) 0 leaky_re_lu_21[0][0
______________________________________________________________________________________
leaky_re_lu_20 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_20[0]
______________________________________________________________________________________
concatenate_1 (Concatenate) (None, 19, 19, 1280) 0 space_to_depth_x2[0]
leaky_re_lu_20[0][0
______________________________________________________________________________________
conv2d_22 (Conv2D) (None, 19, 19, 1024) 11796480 concatenate_1[0][0
______________________________________________________________________________________
batch_normalization_22 (BatchNor (None, 19, 19, 1024) 4096 conv2d_22[0][0]
______________________________________________________________________________________
leaky_re_lu_22 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_22[0]
______________________________________________________________________________________
conv2d_23 (Conv2D) (None, 19, 19, 425) 435625 leaky_re_lu_22[0][0
======================================================================================
Total params: 50,983,561
Trainable params: 50,962,889
Non-trainable params: 20,672
______________________________________________________________________________________
https://github.com/allanzelener/YAD2K
https://github.com/thtrieu/darkflow
https://pjreddie.com/darknet/yolo/
Our model has several advantages over classifier-based systems. It looks at the whole
image at test time so its predictions are informed by global context in the image. It also
makes predictions with a single network evaluation unlike systems like R-CNN which
require thousands for a single image. This makes it extremely fast, more than 1000x
faster than R-CNN and 100x faster than Fast R-CNN. See our paper for more details on
the full system.
But one of the downsides of YOLO that it process a lot of areas where no objects are present.
R-CNN tries to pick a few windows and run a Conv net (your confident classifier) on top of them.
The algorithm R-CNN uses to pick windows is called a segmentation algorithm. Outputs
something like this:
If for example the segmentation algorithm produces 2000 blob then we should run our
classifier/CNN on top of these blobs.
There has been a lot of work regarding R-CNN tries to make it faster:
R-CNN:
Propose regions. Classify proposed regions one at a time. Output label + bounding box.
Downside is that its slow.
[Girshik et. al, 2013. Rich feature hierarchies for accurate object detection and semantic
segmentation]
Fast R-CNN:
Propose regions. Use convolution implementation of sliding windows to classify all the
proposed regions.
[Girshik, 2015. Fast R-CNN]
Faster R-CNN:
Use convolutional network to propose regions.
[Ren et. al, 2016. Faster R-CNN: Towards real-time object detection with region
proposal networks]
Mask R-CNN:
https://arxiv.org/abs/1703.06870
Most of the implementation of faster R-CNN are still slower than YOLO.
Andrew Ng thinks that the idea behind YOLO is better than R-CNN because you are able to do
all the things in just one time instead of two times.
Other algorithms that uses one shot to get the output includes SSD and MultiBox.
Face Recognition
Face recognition system identifies a person's face. It can work on both images or videos.
Liveness detection within a video face recognition system prevents the network from identifying
a face in an image. It can be learned by supervised deep learning using a dataset for live human
and in-live human and sequence learning.
Face verification vs. face recognition:
Verification:
Input: image, name/ID. (1 : 1)
Output: whether the input image is that of the claimed person.
"is this the claimed person?"
Recognition:
Has a database of K persons
Get an input image
Output ID if the image is any of the K persons (or not recognized)
"who is this person?"
We can use a face verification system to make a face recognition system. The accuracy of the
verification system has to be high (around 99.9% or more) to be use accurately within a
recognition system because the recognition system accuracy will be less than the verification
system given K persons.
One of the face recognition challenges is to solve one shot learning problem.
One Shot Learning: A recognition system is able to recognize a person, learning from one image.
Historically deep learning doesn't work well with a small number of data.
Instead to make this work, we will learn a similarity function:
d( img1, img2 ) = degree of difference between images.
We want d result to be low in case of the same faces.
We use tau T as a threshold for d:
If d( img1, img2 ) <= T Then the faces are the same.
Similarity function helps us solving the one shot learning. Also its robust to new inputs.
Siamese Network
We will implement the similarity function using a type of NNs called Siamease Network in which
we can pass multiple inputs to the two or more networks with the same architecture and
parameters.
Siamese network architecture are as the following:
We make 2 identical conv nets which encodes an input image into a vector. In the above
image the vector shape is (128, )
The loss function will be d(x1, x2) = || f(x1) - f(x2) ||^2
If X1 , X2 are the same person, we want d to be low. If they are different persons, we want
d to be high.
[Taigman et. al., 2014. DeepFace closing the gap to human level performance]
Triplet Loss
Triplet Loss is one of the loss functions we can use to solve the similarity distance in a Siamese
network.
Our learning objective in the triplet loss function is to get the distance between an Anchor image
and a positive or a negative image.
Positive means same person, while negative means different person.
The triplet name came from that we are comparing an anchor A with a positive P and a negative
N image.
Formally we want:
Positive distance to be less than negative distance
||f(A) - f(P)||^2 <= ||f(A) - f(N)||^2
Then
||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 <= 0
To make sure the NN won't get an output of zeros easily:
||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 <= -alpha
Alpha is a small number. Sometimes its called the margin.
Then
||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + alpha <= 0
You need multiple images of the same person in your dataset. Then get some triplets out of your
dataset. Dataset should be big enough.
Choosing the triplets A, P, N:
During training if A, P, N are chosen randomly (Subjet to A and P are the same and A and N
aren't the same) then one of the problems this constrain is easily satisfied
d(A, P) + alpha <= d (A, N)
Triplet loss is one way to learn the parameters of a conv net for face recognition there's another
way to learn these parameters as a straight binary classification problem.
Learning the similarity function another way:
In order to implement this you need to look at the features extracted by the Conv net at the
shallower and deeper layers.
It uses a previously trained convolutional network like VGG, and builds on top of that. The idea of
using a network trained on a different task and applying it to a new task is called transfer
learning.
What are deep ConvNets learning?
Pick a unit in layer l. Find the nine image patches that maximize the unit's activation.
Notice that a hidden unit in layer one will see relatively small portion of NN, so if you
plotted it it will match a small image in the shallower layers while it will get larger image
in deeper layers.
Repeat for other units and layers.
It turns out that layer 1 are learning the low level representations like colors and edges.
You will find out that each layer are learning more complex representations.
The first layer was created using the weights of the first layer. Other images are generated using
the receptive field in the image that triggered the neuron to be max.
[Zeiler and Fergus., 2013, Visualizing and understanding convolutional networks]
A good explanation on how to get receptive field given a layer:
From A guide to receptive field arithmetic for Convolutional Neural Networks
Cost Function
We will define a cost function for the generated image that measures how good it is.
Give a content image C, a style image S, and a generated image G:
J(G) = alpha * J(C,G) + beta * J(S,G)
J(C, G) measures how similar is the generated image to the Content image.
J(S, G) measures how similar is the generated image to the Style image.
alpha and beta are relative weighting to the similarity and these are hyperparameters.
Find the generated image G:
i. Initiate G randomly
For example G: 100 X 100 X 3
ii. Use gradient descent to minimize J(G)
G = G - dG We compute the gradient image and use gradient decent to minimize the
cost function.
The iterations might be as following image:
To Generate this:
You will go through this:
In the previous section we showed that we need a cost function for the content image and the
style image to measure how similar is them to each other.
Say you use hidden layer l to compute content cost.
If we choose l to be small (like layer 1), we will force the network to get similar output to
the original content image.
In practice l is not too shallow and not too deep but in the middle.
Use pre-trained ConvNet. (E.g., VGG network)
Let a(c)[l] and a(G)[l] be the activation of layer l on the images.
If a(c)[l] and a(G)[l] are similar then they will have the same content
J(C, G) at a layer l = 1/2 || a(c)[l] - a(G)[l] ||^2
As it appears its the sum of the multiplication of each member in the matrix.
To compute gram matrix efficiently:
Reshape activation from H X W X C to HW X C
Name the reshaped activation F.
G[l] = F * F.T
Steps to be made if you want to create a tensorflow model for neural style transfer:
i. Create an Interactive Session.
ii. Load the content image.
iii. Load the style image
iv. Randomly initialize the image to be generated
v. Load the VGG16 model
vi. Build the TensorFlow graph:
Run the content image through the VGG16 model and compute the content cost
Run the style image through the VGG16 model and compute the style cost
Compute the total cost
Define the optimizer and the learning rate
vii. Initialize the TensorFlow graph and run it for a large number of iterations, updating the
generated image at every step.
1D and 3D Generalizations
So far we have used the Conv nets for images which are 2D.
Conv nets can work with 1D and 3D data as well.
An example of 1D convolution:
Input shape (14, 1)
Applying 16 filters with F = 5 , S = 1
Output shape will be 10 X 16
Applying 32 filters with F = 5, S = 1
Output shape will be 6 X 32
The general equation (N - F)/S + 1 can be applied here but here it gives a vector rather than a
2D matrix.
1D data comes from a lot of resources such as waves, sounds, heartbeat signals.
In most of the applications that uses 1D data we use Recurrent Neural Network RNN.
3D data also are available in some applications like CT scan:
Example of 3D convolution:
Input shape (14, 14,14, 1)
Applying 16 filters with F = 5 , S = 1
Output shape (10, 10, 10, 16)
Applying 32 filters with F = 5, S = 1
Output shape will be (6, 6, 6, 32)
Extras
Keras
Keras is a high-level neural networks API (programming framework), written in Python and
capable of running on top of several lower-level frameworks including TensorFlow, Theano, and
CNTK.
Keras was developed to enable deep learning engineers to build and experiment with different
models very quickly.
Just as TensorFlow is a higher-level framework than Python, Keras is an even higher-level
framework and provides additional abstractions.
Keras will work fine for many common models.
Layers in Keras:
Dense (Fully connected layers).
A linear function followed by a non linear function.
Convolutional layer.
Pooling layer.
Normalisation layer.
A batch normalization layer.
Flatten layer
Flatten a matrix into vector.
Activation layer
Different activations include: relu, tanh, sigmoid, and softmax.
To train and test a model in Keras there are four steps:
i. Create the model.
ii. Compile the model by calling model.compile(optimizer = "...", loss = "...", metrics =
["accuracy"])
iii. Train the model on train data by calling model.fit(x = ..., y = ..., epochs = ...,
batch_size = ...)
You can add a validation set while training too.
iv. Test the model on test data by calling model.evaluate(x = ..., y = ...)
Summarize of step in Keras: Create->Compile->Fit/Train->Evaluate/Test
Model.summary() gives a lot of useful informations regarding your model including each layers
inputs, outputs, and number of parameters at each layer.
To choose the Keras backend you should go to $HOME/.keras/keras.json and change the file to
the desired backend like Theano or Tensorflow or whatever backend you want.
After you create the model you can run it in a tensorflow session without compiling, training, and
testing capabilities.
You can save your model with model_save and load your model using model_load This will save
your whole trained model to disk with the trained weights.
Sequence Models
This is the fifth and final course of the deep learning specialization at Coursera which is moderated by
deeplearning.ai. The course is taught by Andrew Ng.
Table of contents
Sequence Models
Table of contents
Course summary
Recurrent Neural Networks
Why sequence models
Notation
Recurrent Neural Network Model
Backpropagation through time
Different types of RNNs
Language model and sequence generation
Sampling novel sequences
Vanishing gradients with RNNs
Gated Recurrent Unit (GRU)
Long Short Term Memory (LSTM)
Bidirectional RNN
Deep RNNs
Back propagation with RNNs
Natural Language Processing & Word Embeddings
Introduction to Word Embeddings
Word Representation
Using word embeddings
Properties of word embeddings
Embedding matrix
Learning Word Embeddings: Word2vec & GloVe
Learning word embeddings
Word2Vec
Negative Sampling
GloVe word vectors
Applications using Word Embeddings
Sentiment Classification
Debiasing word embeddings
Sequence models & Attention mechanism
Various sequence to sequence architectures
Basic Models
Picking the most likely sentence
Beam Search
Refinements to Beam Search
Error analysis in beam search
BLEU Score
Attention Model Intuition
Attention Model
Speech recognition - Audio data
Speech recognition
Trigger Word Detection
Extras
Machine translation attention model (From notebooks)
Course summary
Here are the course summary as its given on the course link:
This course will teach you how to build models for natural language, audio, and other sequence
data. Thanks to deep learning, sequence algorithms are working far better than just two years
ago, and this is enabling numerous exciting applications in speech recognition, music synthesis,
chatbots, machine translation, natural language understanding, and many others.
You will:
Understand how to build and train Recurrent Neural Networks (RNNs), and commonly-used
variants such as GRUs and LSTMs.
Be able to apply sequence models to natural language problems, including text synthesis.
Be able to apply sequence models to audio applications, including speech recognition and
music synthesis.
This is the fifth and final course of the Deep Learning Specialization.
Notation
In this section we will discuss the notations that we will use through the course.
Motivating example:
We will index the first element of x by x<1>, the second x<2> and so on.
x<1> = Harry
x<2> = Potter
Similarly, we will index the first element of y by y<1>, the second y<2> and so on.
y<1> = 1
y<2> = 1
Tx is the size of the input sequence and Ty is the size of the output sequence.
Tx = Ty = 9 in the last example although they can be different in other problems.
x(i)<t> is the element t of the sequence of input vector i. Similarly y(i)<t> means the t-th element in
the output sequence of the i training example.
Tx(i) the input sequence length for training example i. It can be different across the examples.
Similarly for Ty(i) will be the length of the output sequence in the i-th training example.
Representing words:
We will now work in this course with NLP which stands for natural language processing. One
of the challenges of NLP is how can we represent a word?
i. We need a vocabulary list that contains all the words in our target sets.
Example:
[a ... And ... Harry ... Potter ... Zulu]
Each word will have a unique index that it can be represented with.
The sorting here is in alphabetical order.
Vocabulary sizes in modern applications are from 30,000 to 50,000. 100,000 is not
uncommon. Some of the bigger companies use even a million.
To build vocabulary list, you can read all the texts you have and get m words with the
most occurrence, or search online for m most occurrent words.
ii. Create a one-hot encoding sequence for each word in your dataset given the vocabulary
you have created.
While converting, what if we meet a word thats not in your dictionary?
We can add a token in the vocabulary with name <UNK> which stands for unknown text
and use its index for your one-hot vector.
Full example:
The goal is given this representation for x to learn a mapping using a sequence model to then
target output y as a supervised learning problem.
Recurrent Neural Network Model
Why not to use a standard network for sequence tasks? There are two problems:
Inputs, outputs can be different lengths in different examples.
This can be solved for normal NNs by paddings with the maximum lengths but it's not a
good solution.
Doesn't share features learned across different positions of text/sequence.
Using a feature sharing like in CNNs can significantly reduce the number of parameters
in your model. That's what we will do in RNNs.
Recurrent neural network doesn't have either of the two mentioned problems.
Lets build a RNN that solves name entity recognition task:
In this problem Tx = Ty. In other problems where they aren't equal, the RNN architecture
may be different.
a<0> is usually initialized with zeros, but some others may initialize it randomly in some
cases.
There are three weight matrices here: Wax, Waa, and Wya with shapes:
Wax: (NoOfHiddenNeurons, nx)
Waa: (NoOfHiddenNeurons, NoOfHiddenNeurons)
Wya: (ny, NoOfHiddenNeurons)
The weight matrix Waa is the memory the RNN is trying to maintain from the previous layers.
A lot of papers and books write the same architecture this way:
It's harder to interpreter. It's easier to roll this drawings to the unrolled version.
In the discussed RNN architecture, the current output ŷ<t> depends on the previous inputs and
activations.
Let's have this example 'He Said, "Teddy Roosevelt was a great president"'. In this example Teddy
is a person name but we know that from the word president that came after Teddy not from He
and said that were before it.
So limitation of the discussed architecture is that it can not learn from elements later in the
sequence. To address this problem we will later discuss Bidirectional RNN (BRNN).
Now let's discuss the forward propagation equations on the discussed architecture:
The activation function of a is usually tanh or ReLU and for y depends on your task choosing
some activation functions like sigmoid and softmax. In name entity recognition task we will
use sigmoid because we only have two classes.
In order to help us develop complex RNN architectures, the last equations needs to be simplified
a bit.
Simplified RNN notation:
Where wa, ba, wy, and by are shared across each element in a sequence.
We will use the cross-entropy loss function:
Where the first equation is the loss for one example and the loss for the whole sequence is
given by the summation over all the calculated single example losses.
Graph with losses:
The backpropagation here is called backpropagation through time because we pass activation
a from one sequence element to another like backwards in time.
Note that starting the second layer we are feeding the generated output back to the
network.
There are another interesting architecture in Many To Many. Applications like machine
translation inputs and outputs sequences have different lengths in most of the cases. So an
alternative Many To Many architecture that fits the translation would be as follows:
There are an encoder and a decoder parts in this architecture. The encoder encodes the
input sequence into one matrix and feed it to the decoder to generate the outputs. Encoder
and decoder have different weight matrices.
Summary of RNN types:
There is another architecture which is the attention architecture which we will talk about in
chapter 3.
ii. We first pass a<0> = zeros vector, and x<1> = zeros vector.
iii. Then we choose a prediction randomly from distribution obtained by ŷ<1>. For example it
could be "The".
In numpy this can be implemented using: numpy.random.choice(...)
This is the line where you get a random beginning of the sentence each time you
sample run a novel sequence.
iv. We pass the last predicted word with the calculated a<1>
v. We keep doing 3 & 4 steps for a fixed length or until we get the <EOS> token.
vi. You can reject any <UNK> token if you mind finding it in your output.
So far we have to build a word-level language model. It's also possible to implement a
character-level language model.
In the character-level language model, the vocabulary will contain [a-zA-Z0-9] , punctuation,
special characters and possibly token.
Character-level language model has some pros and cons compared to the word-level language
model
Pros:
a. There will be no <UNK> token - it can create any word.
Cons:
a. The main disadvantage is that you end up with much longer sequences.
b. Character-level language models are not as good as word-level language models at
capturing long range dependencies between how the the earlier parts of the sentence
also affect the later part of the sentence.
c. Also more computationally expensive and harder to train.
The trend Andrew has seen in NLP is that for the most part, a word-level language model is still
used, but as computers get faster there are more and more applications where people are, at
least in some special cases, starting to look at more character-level models. Also, they are used in
specialized applications where you might need to deal with unknown words or other vocabulary
words a lot. Or they are also used in more specialized applications where you have a more
specialized vocabulary.
An RNN that process a sequence data with the size of 10,000 time steps, has 10,000 deep layers
which is very hard to optimize.
Let's take an example. Suppose we are working with language modeling problem and there are
two sequences that model tries to learn:
What we need to learn here that "was" came with "cat" and that "were" came with "cats". The
naive RNN is not very good at capturing very long-term dependencies like this.
As we have discussed in Deep neural networks, deeper networks are getting into the vanishing
gradient problem. That also happens with RNNs with a long sequence size.
For computing the word "was", we need to compute the gradient for everything behind.
Multiplying fractions tends to vanish the gradient, while multiplication of large number
tends to explode it.
Therefore some of your weights may not be updated properly.
In the problem we descried it means that its hard for the network to memorize "was" word all
over back to "cat". So in this case, the network won't identify the singular/plural words so that it
gives it the right grammar form of verb was/were.
Vanishing gradients problem tends to be the bigger problem with RNNs than the exploding
gradients problem. We will discuss how to solve it in next sections.
Exploding gradients can be easily seen when your weight values become NaN . So one of the
ways solve exploding gradient is to apply gradient clipping means if your gradient is more than
some threshold - re-scale some of your gradient vector so that is not too big. So there are cliped
according to some maximum value.
Extra:
Each layer in GRUs has a new variable C which is the memory cell. It can tell to whether
memorize something or not.
Lets take the cat sentence example and apply it to understand this equations:
We will suppose that U is 0 or 1 and is a bit that tells us if a singular word needs to be
memorized.
The 0 val
cat 1 new_val
which 0 new_val
already 0 new_val
... 0 new_val
full .. ..
Because the update gate U is usually a small number like 0.00001, GRUs doesn't suffer the
vanishing gradient problem.
Shapes:
What has been descried so far is the Simplified GRU unit. Let's now describe the full one:
The full GRU contains a new gate that is used with to calculate the candidate C. The gate
tells you how relevant is C<t-1> to C<t>
Equations:
So why we use these architectures, why don't we change them, how we know they will work, why
not add another gate, why not use the simpler GRU instead of the full GRU; well researchers has
experimented over years all the various types of these architectures with many many different
versions and also addressing the vanishing gradient problem. They have found that full GRUs are
one of the best RNN architectures to be used for many different problems. You can make your
design but put in mind that GRUs and LSTMs are standards.
In GRU we have an update gate U , a relevance gate r , and a candidate cell variables C~<t>
while in LSTM we have an update gate U (sometimes it's called input gate I), a forget gate F ,
an output gate O , and a candidate cell variables C~<t>
Drawings (inspired by http://colah.github.io/posts/2015-08-Understanding-LSTMs/):
Bidirectional RNN
There are still some ideas to let you build much more powerful sequence models. One of them is
bidirectional RNNs and another is Deep RNNs.
As we saw before, here is an example of the Name entity recognition task:
The name Teddy cannot be learned from He and said, but can be learned from bears.
BiRNNs fixes this issue.
Here is BRNNs architecture:
Deep RNNs
In a lot of cases the standard one layer RNNs will solve your problem. But in some problems its
useful to stack some RNN layers to make a deeper network.
For example, a deep RNN with 3 layers would look like this:
In feed-forward deep nets, there could be 100 or even 200 layers. In deep RNNs stacking 3 layers
is already considered deep and expensive to train.
In some cases you might see some feed-forward network layers connected after recurrent cell.
Back propagation with RNNs
In modern deep learning frameworks, you only have to implement the forward pass, and
the framework takes care of the backward pass, so most deep learning engineers do not
need to bother with the details of the backward pass. If however you are an expert in
calculus and want to see the details of backprop in RNNs, you can work through this
optional portion of the notebook.
The quote is taken from this notebook. If you want the details of the back propagation with
programming notes look at the linked notebook.
Word Representation
NLP has been revolutionized by deep learning and especially by RNNs and deep RNNs.
Word embeddings is a way of representing words. It lets your algorithm automatically
understand the analogies between words like "king" and "queen".
So far we have defined our language by a vocabulary. Then represented our words with a one-
hot vector that represents the word in the vocabulary.
An image example would be:
We will use the annotation O idx for any word that is represented with one-hot like in the
image.
One of the weaknesses of this representation is that it treats a word as a thing that itself and
it doesn't allow an algorithm to generalize across words.
For example: "I want a glass of orange ______", a model should predict the next word as
juice.
A similar example "I want a glass of apple ______", a model won't easily predict juice
here if it wasn't trained on it. And if so the two examples aren't related although orange
and apple are similar.
Inner product between any one-hot encoding vector is zero. Also, the distances between
them are the same.
So, instead of a one-hot presentation, won't it be nice if we can learn a featurized representation
with each of these words: man, woman, king, queen, apple, and orange?
Each word will have a, for example, 300 features with a type of float point number.
Each word column will be a 300-dimensional vector which will be the representation.
We will use the notation e5391 to describe man word features vector.
Now, if we return to the examples we described again:
"I want a glass of orange ______"
I want a glass of apple ______
Orange and apple now share a lot of similar features which makes it easier for an algorithm
to generalize between them.
We call this representation Word embeddings.
To visualize word embeddings we use a t-SNE algorithm to reduce the features to 2 dimensions
which makes it easy to visualize:
You will get a sense that more related words are closer to each other.
The word embeddings came from that we need to embed a unique vector inside a n-
dimensional space.
Let's see how we can take the feature representation we have extracted from each word and
apply it in the Named entity recognition problem.
Given this example (from named entity recognition):
In this problem, we encode each face into a vector and then check how similar are these
vectors.
Words encoding and embeddings have a similar meaning here.
In the word embeddings task, we are learning a representation for each word in our vocabulary
(unlike in image encoding where we have to map each new image to some n-dimensional
vector). We will discuss the algorithm in next sections.
One of the most fascinating properties of word embeddings is that they can also help with
analogy reasoning. While analogy reasoning may not be by itself the most important NLP
application, but it might help convey a sense of what these word embeddings can do.
Analogies example:
Given this word embeddings table:
It turns out that eQueen is the best solution here that gets the the similar vector.
Cosine similarity - the most commonly used similarity function:
Equation:
The top part represents the inner product of u and v vectors. It will be large if the
vectors are very similar.
You can also use Euclidean distance as a similarity function (but it rather measures a dissimilarity,
so you should take it with negative sign).
We can use this equation to calculate the similarities between word embeddings and on the
analogy problem where u = ew and v = eking - eman + ewoman
Embedding matrix
When you implement an algorithm to learn a word embedding, what you end up learning is a
embedding matrix.
Let's take an example:
Suppose we are using 10,000 words as our vocabulary (plus token).
The algorithm should create a matrix E of the shape (300, 10000) in case we are extracting
300 features.
If O6257 is the one hot encoding of the word orange of shape (10000, 1), then
np.dot( E ,O6257) = e6257 which shape is (300, 1).
Generally np.dot( E , Oj) = ej
In the next sections, you will see that we first initialize E randomly and then try to learn all the
parameters of this matrix.
In practice it's not efficient to use a dot multiplication when you are trying to extract the
embeddings of a specific word, instead, we will use slicing to slice a specific column. In Keras
there is an embedding layer that extracts this column with no multiplication.
Let's start learning some algorithms that can learn word embeddings.
At the start, word embeddings algorithms were complex but then they got simpler and simpler.
We will start by learning the complex examples to make more intuition.
Neural language model:
Let's start with an example:
We want to build a language model so that we can predict the next word.
So we use this neural network to learn the language model
For example, we have the sentence: "I want a glass of orange juice to go along with my
cereal"
orange juice +1
orange glass -2
orange my +6
This is not an easy learning problem because learning within -10/+10 words (10 - an
example) is hard.
Word2Vec model:
Vocabulary size = 10,000 words
Let's say that the context word are c and the target word is t
We want to learn c to t
We get ec by E . oc
We then use a softmax layer to get P(t|c) which is ŷ
Also we will use the cross-entropy loss function.
This model is called skip-grams model.
The last model has a problem with the softmax layer:
Here we are summing 10,000 numbers which corresponds to the number of words in our
vocabulary.
If this number is larger say 1 million, the computation will become very slow.
One of the solutions for the last problem is to use "Hierarchical softmax classifier" which works
as a tree classifier.
In practice, the hierarchical softmax classifier doesn't use a balanced tree like the drawn one.
Common words are at the top and less common are at the bottom.
How to sample the context c?
One way is to choose the context by random from your corpus.
If you have done it that way, there will be frequent words like "the, of, a, and, to, .." that can
dominate other words like "orange, apple, durian,..."
In practice, we don't take the context uniformly random, instead there are some heuristics to
balance the common words and the non-common words.
word2vec paper includes 2 ideas of learning word embeddings. One is skip-gram model and
another is CBoW (continuous bag-of-words).
Negative Sampling
Negative sampling allows you to do something similar to the skip-gram model, but with a much
more efficient learning algorithm. We will create a different learning problem.
orange juice 1
orange king 0
orange book 0
orange the 0
orange of 0
We get positive example by using the same skip-grams technique, with a fixed window that goes
around.
We will have a ratio of k negative examples to 1 positive ones in the data we are collecting.
Now let's define the model that will learn this supervised learning problem:
Lets say that the context word are c and the word are t and y is the target.
We will apply the simple logistic regression model.
So we are like having 10,000 binary classification problems, and we only train k+1 classifier
of them in each iteration.
We can sample according to empirical frequencies in words corpus which means according
to how often different words appears. But the problem with that is that we will have more
frequent words like the, of, and...
The best is to sample with this equation (according to authors):
GloVe is another algorithm for learning the word embedding. It's the simplest of them.
This is not used as much as word2vec or skip-gram models, but it has some enthusiasts because
of its simplicity.
GloVe stands for Global vectors for word representation.
Let's use our previous example: "I want a glass of orange juice to go along with my cereal".
We will choose a context and a target from the choices we have mentioned in the previous
sections.
Then we will calculate this for every pair: Xct = # times t appears in context of c
Xct = Xtc if we choose a window pair, but they will not equal if we choose the previous words for
example. In GloVe they use a window which means they are equal
f(x) - the weighting term, used for many reasons which include:
The log(0) problem, which might occur if there are no pairs for the given target and
context values.
Giving not too much weight for stop words like "is", "the", and "this" which occur many
times.
Giving not too little weight for infrequent words.
Theta and e are symmetric which helps getting the final word embedding.
If this is your first try, you should try to download a pre-trained model that has been made
and actually works best.
If you have enough data, you can try to implement one of the available algorithms.
Because word embeddings are very computationally expensive to train, most ML
practitioners will load a pre-trained set of embeddings.
A final note that you can't guarantee that the axis used to represent the features will be
well-aligned with what might be easily humanly interpretable axis like gender, royal, age.
As we have discussed before, Sentiment classification is the process of finding if a text has a
positive or a negative review. Its so useful in NLP and is used in so many applications. An
example would be:
One of the challenges with it, is that you might not have a huge labeled training data for it, but
using word embeddings can help getting rid of this.
The common dataset sizes varies from 10,000 to 100,000 words.
A simple sentiment classification model would be like this:
The embedding matrix may have been trained on say 100 billion words.
Number of features in word embedding is 300.
We can use sum or average given all the words then pass it to a softmax classifier. That
makes this classifier works for short or long sentences.
One of the problems with this simple model is that it ignores words order. For example
"Completely lacking in good taste, good service, and good ambience" has the word good 3 times
but its a negative review.
A better model uses an RNN for solving this problem:
And so if you train this algorithm, you end up with a pretty decent sentiment classification
algorithm.
Also, it will generalize better even if words weren't in your dataset. For example you have the
sentence "Completely absent of good taste, good service, and good ambience", then even if
the word "absent" is not in your label training set, if it was in your 1 billion or 100 billion
word corpus used to train the word embeddings, it might still get this right and generalize
much better even to words that were in the training set used to train the word embeddings
but not necessarily in the label training set that you had for specifically the sentiment
classification problem.
We want to make sure that our word embeddings are free from undesirable forms of bias, such
as gender bias, ethnicity bias and so on.
Horrifying results on the trained word embeddings in the context of Analogies:
Man : Computer_programmer as Woman : Homemaker
Father : Doctor as Mother : Nurse
Word embeddings can reflect gender, ethnicity, age, sexual orientation, and other biases of text
used to train the model.
Learning algorithms by general are making important decisions and it mustn't be biased.
Andrew thinks we actually have better ideas for quickly reducing the bias in AI than for quickly
reducing the bias in the human race, although it still needs a lot of work to be done.
Addressing bias in word embeddings steps:
Idea from the paper: https://arxiv.org/abs/1607.06520
Given these learned embeddings:
We need to solve the gender bias here. The steps we will discuss can help solve any bias
problem but we are focusing here on gender bias.
Here are the steps:
a. Identify the direction:
Calculate the difference between:
ehe - eshe
emale - efemale
....
Choose some k differences and average them.
This will help you find this:
By that we have found the bias direction which is 1D vector and the non-bias
vector which is 299D vector.
b. Neutralize: For every word that is not definitional, project to get rid of bias.
Babysitter and doctor need to be neutral so we project them on non-bias axis with
the direction of the bias:
After that they will be equal in the term of gender.
To do this the authors of the paper trained a classifier to tell the words that
need to be neutralized or not.
c. Equalize pairs
We want each pair to have difference only in gender. Like:
Grandfather - Grandmother
He - She
Boy - Girl
We want to do this because the distance between grandfather and babysitter is
bigger than babysitter and grandmother:
Basic Models
In this section we will learn about sequence to sequence - Many to Many - models which are
useful in various applications including machine translation and speech recognition.
Let's start with the basic model:
Given this machine translation problem in which X is a French sequence and Y is an English
sequence.
The architecture uses a pretrained CNN (like AlexNet) as an encoder for the image, and the
decoder is an RNN.
Ideas are from the following papers (they share similar ideas):
Maoet et. al., 2014. Deep captioning with multimodal recurrent neural networks
Vinyals et. al., 2014. Show and tell: Neural image caption generator
Karpathy and Li, 2015. Deep visual-semantic alignments for generating image
descriptions
There are some similarities between the language model we have learned previously, and the
machine translation model we have just discussed, but there are some differences as well.
The language model we have learned is very similar to the decoder part of the machine
translation model, except for a<0>
The most common algorithm is the beam search, which we will explain in the next section.
Why not use greedy search? Why not get the best choices each time?
It turns out that this approach doesn't really work!
Lets explain it with an example:
The best output for the example we talked about is "Jane is visiting Africa in
September."
Suppose that when you are choosing with greedy approach, the first two words were
"Jane is", the word that may come after that will be "going" as "going" is the most
common word that comes after " is" so the result may look like this: "Jane is going to be
visiting Africa in September.". And that isn't the best/optimal solution.
So what is better than greedy approach, is to get an approximate solution, that will try to
maximize the output (the last equation above).
Beam Search
Beam search is the most widely used algorithm to get the best output sequence. It's a heuristic
search algorithm.
To illustrate the algorithm we will stick with the example from the previous section. We need Y =
"Jane is visiting Africa in September."
The algorithm has a parameter B which is the beam width. Lets take B = 3 which means the
algorithm will get 3 outputs at a time.
For the first step you will get ["in", "jane", "september"] words that are the best candidates.
Then for each word in the first output, get B next (second) words and select top best B
combinations where the best are those what give the highest value of multiplying both
probabilities - P(y<1>|x) * P(y<2>|x,y<1>). Se we will have then ["in september", "jane is", "jane
visit"]. Notice, that we automatically discard september as a first word.
Repeat the same process and get the best B words for ["september", "is", "visit"] and so on.
In this algorithm, keep only B instances of your network.
If B = 1 this will become the greedy search.
But there's another problem. The two optimization functions we have mentioned are
preferring small sequences rather than long ones. Because multiplying more fractions gives
a smaller value, so fewer fractions - bigger result.
So there's another step - dividing by the number of elements in the sequence.
Unlike exact search algorithms like BFS (Breadth First Search) or DFS (Depth First Search),
Beam Search runs faster but is not guaranteed to find the exact solution.
We have talked before on Error analysis in "Structuring Machine Learning Projects" course. We
will apply these concepts to improve our beam search algorithm.
We will use error analysis to figure out if the B hyperparameter of the beam search is the
problem (it doesn't get an optimal solution) or in our RNN part.
Let's take an example:
Initial info:
x = "Jane visite l’Afrique en septembre."
y* = "Jane visits Africa in September." - right answer
ŷ = "Jane visited Africa last September." - answer produced by model
Our model that has produced not a good result.
We now want to know who to blame - the RNN or the beam search.
To do that, we calculate P(y* | X) and P(ŷ | X). There are two cases:
Case 1 (P(y* | X) > P(ŷ | X)):
Conclusion: Beam search is at fault.
Case 2 (P(y* | X) <= P(ŷ | X)):
Conclusion: RNN model is at fault.
The error analysis process is as following:
You choose N error examples and make the following table:
BLEU Score
One of the challenges of machine translation, is that given a sentence in a language there are
one or more possible good translation in another language. So how do we evaluate our results?
The way we do this is by using BLEU score. BLEU stands for bilingual evaluation understudy.
The intuition is: as long as the machine-generated translation is pretty close to any of the
references provided by humans, then it will get a high BLEU score.
Let's take an example:
X = "Le chat est sur le tapis."
Y1 = "The cat is on the mat." (human reference 1)
Y2 = "There is a cat on the mat." (human reference 2)
Suppose that the machine outputs: "the the the the the the the."
One way to evaluate the machine output is to look at each word in the output and check if it
is in the references. This is called precision:
precision = 7/7 because "the" appeared in Y1 or Y2
This is not a useful measure!
We can use a modified precision in which we are looking for the reference with the
maximum number of a particular word and set the maximum appearing of this word to this
number. So:
modified precision = 2/7 because the max is 2 in Y1
We clipped the 7 times by the max which is 2.
Here we are looking at one word at a time - unigrams, we may look at n-grams too
BLEU score on bigrams
The n-grams typically are collected from a text or speech corpus. When the items are words,
n-grams may also be called shingles. An n-gram of size 1 is referred to as a "unigram"; size
2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram".
Suppose that the machine outputs: "the cat the cat on the mat."
cat the 1 0
cat on 1 1 (Y2)
on the 1 1 (Y1)
Totals 6 4
So far we were using sequence to sequence models with an encoder and decoders. There is a
technique called attention which makes these models even better.
The attention idea has been one of the most influential ideas in deep learning.
The problem of long sequences:
Given this model, inputs, and outputs.
The encoder should memorize this long sequence into one vector, and the decoder has to
process this vector to generate the translation.
If a human would translate this sentence, he/she wouldn't read the whole sentence and
memorize it then try to translate it. He/she translates a part at a time.
The performance of this model decreases if a sentence is long.
We will discuss the attention model that works like a human that looks at parts at a time.
That will significantly increase the accuracy even with longer sequence:
Blue is the normal model, while green is the model with attention mechanism.
In this section we will give just some intuitions about the attention model and in the next section
we will discuss it's details.
At first the attention model was developed for machine translation but then other applications
used it like computer vision and new architectures like Neural Turing machine.
The attention model was descried in this paper:
Bahdanau et. al., 2014. Neural machine translation by jointly learning to align and translate
Now for the intuition:
Suppose that our encoder is a bidirectional RNN:
We give the French sentence to the encoder and it should generate a vector that represents
the inputs.
Now to generate the first word in English which is "Jane" we will make another RNN which is
the decoder.
Attention weights are used to specify which words are needed when to generate a word. So
to generate "jane" we will look at "jane", "visite", "l'Afrique"
alpha<1,1>, alpha<1,2>, and alpha<1,3> are the attention weights being used.
And so to generate any word there will be a set of attention weights that controls which
words we are looking at right now.
Attention Model
Lets formalize the intuition from the last section into the exact details on how this can be
implemented.
First we will have an bidirectional RNN (most common is LSTMs) that encodes French language:
For learning purposes, lets assume that a<t'> will include the both directions activations at time
step t'.
We will have a unidirectional RNN to produce the output using a context c which is computed
using the attention weights, which denote how much information does the output needs to look
in a<t'>
Sum of the attention weights for each element in the sequence should be 1:
Now we need to know how to calculate e<t, t'>. We will compute e using a small neural
network (usually 1-layer, because we will need to compute this a lot):
s<t-1> is the hidden state of the RNN s, and a<t'> is the activation of the other
bidirectional RNN.
One of the disadvantages of this algorithm is that it takes quadratic time or quadratic cost to run.
One fun way to see how attention works is by visualizing the attention weights:
Speech recognition
One of the most exciting developments using sequence-to-sequence models has been the rise
of very accurate speech recognition.
Let's define the speech recognition problem:
X: audio clip
Y: transcript
If you plot an audio clip it will look like this:
The horizontal axis is time while the vertical is changes in air pressure.
What really is an audio recording? A microphone records little variations in air pressure over
time, and it is these little variations in air pressure that your ear perceives as sound. You can
think of an audio recording is a long list of numbers measuring the little air pressure
changes detected by the microphone. We will use audio sampled at 44100 Hz (or 44100
Hertz). This means the microphone gives us 44100 numbers per second. Thus, a 10 second
audio clip is represented by 441000 numbers (= 10 * 44100).
It is quite difficult to work with "raw" representation of audio.
Because even human ear doesn't process raw wave forms, the human ear can process
different frequencies.
There's a common preprocessing step for an audio - generate a spectrogram which works
similarly to human ears.
The horizontal axis is time while the vertical is frequencies. Intensity of different colors
shows the amount of energy - how loud is the sound for different frequencies (a human
ear does a very similar preprocessing step).
A spectrogram is computed by sliding a window over the raw audio signal, and calculates
the most active frequencies in each window using a Fourier transformation.
In the past days, speech recognition systems were built using phonemes that are a hand
engineered basic units of sound. Linguists used to hypothesize that writing down audio in
terms of these basic units of sound called phonemes would be the best way to do speech
recognition.
End-to-end deep learning found that phonemes was no longer needed. One of the things
that made this possible is the large audio datasets.
Research papers have around 300 - 3000 hours of training data while the best commercial
systems are now trained on over 100,000 hours of audio.
You can build an accurate speech recognition system using the attention model that we have
descried in the previous section:
One of the methods that seem to work well is CTC cost which stands for "Connectionist temporal
classification"
To explain this let's say that Y = "the quick brown fox"
We are going to use an RNN with input, output structure:
The _ is a special character called "blank" and <SPC> is for the "space" character.
Basic rule for CTC: collapse repeated characters not separated by "blank"
So the 19 character in our Y can be generated into 1000 character output using CTC and it's
special blanks.
The ideas were taken from this paper:
Graves et al., 2006. Connectionist Temporal Classification: Labeling unsegmented
sequence data with recurrent neural networks
This paper's ideas were also used by Baidu's DeepSpeech.
Using both attention model and CTC cost can help you to build an accurate speech recognition
system.
With the rise of deep learning speech recognition, there are a lot of devices that can be waked
up by saying some words with your voice. These systems are called trigger word detection
systems.
For example, Alexa - a smart device made by Amazon - can answer your call "Alexa, what time is
it?" and then Alexa will respond to you.
Trigger word detection systems include:
For now, the trigger word detection literature is still evolving so there actually isn't a single
universally agreed on the algorithm for trigger word detection yet. But let's discuss an algorithm
that can be used.
Let's now build a model that can solve this problem:
X: audio clip
X has been preprocessed and spectrogram features have been returned of X
X<1>, X<2>, ... , X<t>
Y will be labels 0 or 1. 0 represents the non-trigger word, while 1 is that trigger word that we
need to detect.
The model architecture can be like this:
The vertical lines in the audio clip represent moment just after the trigger word. The
corresponding to this will be 1.
One disadvantage of this creates a very imbalanced training set. There will be a lot of zeros
and few ones.
A hack to solve this is to make an output a few ones for several times or for a fixed period of
time before reverting back to zero.
Follow Arpit Singh on LinkedIn for more such insightful posts
Extras
There are two separate LSTMs in this model. Because the one at the bottom of the picture is
a Bi-directional LSTM and comes before the attention mechanism, we will call it pre-attention
Bi-LSTM. The LSTM at the top of the diagram comes after the attention mechanism, so we
will call it the post-attention LSTM. The pre-attention Bi-LSTM goes through Tx time steps;
the post-attention LSTM goes through Ty time steps.
The post-attention LSTM passes s <t> , c <t> from one time step to the next. In the lecture
videos, we were using only a basic RNN for the post-activation sequence model, so the state
captured by the RNN output activations s <t> . But since we are using an LSTM here, the
LSTM has both the output activation s <t> and the hidden cell state c <t> . However, unlike
previous text generation examples (such as Dinosaurus in week 1), in this model the post-
activation LSTM at time t does will not take the specific generated y <t-1> as input; it only
takes s <t> and c <t> as input. We have designed the model this way, because (unlike
language generation where adjacent characters are highly correlated) there isn't as strong a
dependency between the previous character and the next character in a YYYY-MM-DD date.
What one "Attention" step does to calculate the attention variables α <t, t>
, which are used to
compute the context variable context <t> for each timestep in the output (t=1, ..., Ty).
The diagram uses a RepeatVector node to copy s <t-1> 's value Tx times, and then
Concatenation to concatenate s <t-1> and a <t> to compute e <t, t> , which is then passed
through a softmax to compute α <t, t>
.
These Notes were made by Mahmoud Badry @2018