Neural Networks and Deep Learning
Neural Networks and Deep Learning
Neural Networks and Deep Learning
This is the first course of the deep learning specialization at Coursera which is moderated by DeepLearning.ai. The course is taught
by Andrew Ng.
Table of contents
Neural Networks and Deep Learning
o Table of contents
o Course summary
o Introduction to deep learning
What is a (Neural Network) NN?
Supervised learning with neural networks
Why is deep learning taking off?
o Neural Networks Basics
Binary classification
Logistic regression
Logistic regression cost function
Gradient Descent
Derivatives
More Derivatives examples
Computation graph
Derivatives with a Computation Graph
Logistic Regression Gradient Descent
Gradient Descent on m Examples
Vectorization
Vectorizing Logistic Regression
Notes on Python and NumPy
General Notes
o Shallow neural networks
Neural Networks Overview
Neural Network Representation
Computing a Neural Network's Output
Vectorizing across multiple examples
Activation functions
Why do you need non-linear activation functions?
Derivatives of activation functions
Gradient descent for Neural Networks
Random Initialization
o Deep Neural Networks
Deep L-layer neural network
Forward Propagation in a Deep Network
Getting your matrix dimensions right
Why deep representations?
Building blocks of deep neural networks
Forward and Backward Propagation
Parameters vs Hyperparameters
What does this have to do with the brain
o Extra: Ian Goodfellow interview
Course summary
Here are the course summary as its given on the course link:
If you want to break into cutting-edge AI, this course will help you do so. Deep learning engineers are highly sought after, and
mastering deep learning will give you numerous new career opportunities. Deep learning is also a new "superpower" that will let you
build AI systems that just weren't possible a few years ago.
In this course, you will learn the foundations of deep learning. When you finish this class, you will:
This course also teaches you how Deep Learning actually works, rather than presenting only a cursory or surface-level description. So
after completing it, you will be able to apply deep learning to a your own applications. If you are looking for a job in AI, after this
course you will also be able to answer basic interview questions.
Basically a single neuron will calculate weighted sum of input(W.T*X) and then we can set a threshold to predict output in
a perceptron. If weighted sum of input cross the threshold, perceptron fires and if not then perceptron doesn't predict.
Disadvantage of perceptron is that it only output binary values and if we try to give small change in weight and bais then
perceptron can flip the output. We need some system which can modify the output slightly according to small change in
weight and bias. Here comes sigmoid function in picture.
If we change perceptron with a sigmoid function, then we can make slight change in output.
e.g. output in perceptron = 0, you slightly changed weight and bias, output becomes = 1 but actual output is 0.7. In case
of sigmoid, output1 = 0, slight change in weight and bias, output = 0.7.
If we apply sigmoid activation function then Single neuron will act as Logistic Regression.
we can understand difference between perceptron and sigmoid function by looking at sigmoid function graph.
Simple NN graph:
o
o Image taken from tutorialspoint.com
RELU stands for rectified linear unit is the most popular activation function right now that makes deep NNs train faster
now.
Hidden layers predicts connection between inputs automatically, thats what deep learning is good at.
Each Input will be connected to the hidden layer and the NN will decide the connections.
Supervised learning means we have the (X,Y) and we need to get the function that maps X to Y.
i. Data:
Using this image we can conclude:
For small data NN can perform as Linear regression or SVM (Support vector machine)
For big data a small NN is better that SVM
For big data a big NN is better that a medium NN is better that small NN.
Hopefully we have a lot of data because the world is using the computer a little bit more
Mobiles
IOT (Internet of things)
ii. Computation:
GPUs.
Powerful CPUs.
Distributed computing.
ASICs
iii. Algorithm:
Binary classification
Mainly he is talking about how to do a logistic regression to make a binary classifier.
o
o Image taken from 3.bp.blogspot.com
He talked about an example of knowing if the current image contains a cat or not.
Here are some notations:
o M is the number of training vectors
o Nx is the size of the input vector
o Ny is the size of the output vector
o X(1) is the first input vector
o Y(1) is the first output vector
o X = [x(1) x(2).. x(M)]
o Y = (y(1) y(2).. y(M))
Logistic regression
Algorithm is used for classification algorithm of 2 classes.
Equations:
o Simple equation: y = wx + b
o If x is a vector: y = w(transpose)x + b
o If we need y to be in between 0 and 1 (probability): y = sigmoid(w(transpose)x + b)
o In some notations this might be used: y = sigmoid(w(transpose)x)
While b is w0 of w and we add x0 = 1. but we won't use this notation in the course (Andrew said that
the first notation is better).
In binary classification Y has to be between 0 and 1.
In the last equation w is a vector of Nx and b is a real number
Gradient Descent
We want to predict w and b that minimize the cost function.
First we initialize w and b to 0,0 or initialize them to a random value in the convex function and then try to improve the
values the reach minimum value.
The gradient decent algorithm repeats: w = w - alpha * dw where alpha is the learning rate and dw is the derivative
of w (Change to w) The derivative is also the slope of w
Looks like greedy algorithms. the derivative give us the direction to improve our parameters.
Derivatives
We will talk about some of required calculus.
You don't need to be a calculus geek to master deep learning but you'll need some skills from it.
Derivative of a linear line is its slope.
o ex. f(a) = 3a d(f(a))/d(a) = 3
o if a = 2 then f(a) = 6
o if we move a a little bit a = 2.001 then f(a) = 6.003 means that we multiplied the derivative (Slope) to the
moved area and added it to the last result.
Computation graph
Its a graph that organizes the computation from left to right.
o
X1 Feature
X2 Feature
W1 Weight of the first feature.
W2 Weight of the second feature.
B Logistic Regression parameter.
M Number of training examples
Y(i) Expected output of i
So we
have:
Then from right to left we will calculate derivations compared to the result:
From the above we can conclude the logistic regression pseudo code:
The above code should run for some iterations to minimize error.
Vectorization is so important on deep learning to reduce loops. In the last code we can make the whole loop in one step
using vectorization!
Vectorization
Deep learning shines when the dataset are big. However for loops will make you wait a lot for a result. Thats why we need
vectorization to get rid of some of our for loops.
NumPy library (dot) function is using vectorization by default.
The vectorization can be done on CPU or GPU thought the SIMD operation. But its faster on GPU.
Whenever possible avoid for loops.
Most of the NumPy library methods are vectorized version.
Reshape is cheap in calculations so put it everywhere you're not sure about the calculations.
Broadcasting works when you do a matrix operation with matrices that doesn't match for the operation, in this case
NumPy automatically makes the shapes ready for the operation by broadcasting the values.
In general principle of broadcasting. If you have an (m,n) matrix and you add(+) or subtract(-) or multiply(*) or divide(/)
with a (1,n) matrix, then this will copy it m times into an (m,n) matrix. The same with if you use those operations with a (m ,
1) matrix, then this will copy it n times into (m, n) matrix. And then apply the addition, subtraction, and multiplication of
division element wise.
o If you didn't specify the shape of a vector, it will take a shape of (m,) and the transpose operation won't work.
You have to reshape it to (m, 1)
o Try to not use the rank one matrix in ANN
o Don't hesitate to use assert(a.shape == (5,1)) to check if your matrix shape is the required one.
o If you've found a rank one matrix try to run reshape on it.
Jupyter / IPython notebooks are so useful library in python that makes it easy to integrate code and document at the
same time. It runs in the browser and doesn't need an IDE to run.
o To open Jupyter Notebook, open the command line and call: jupyter-notebook It should be installed to work.
s = sigmoid(x)
ds = s * (1 - s) # derivative using calculus
To make an image of (width,height,depth) be a vector, use this:
v = image.reshape(image.shape[0]*image.shape[1]*image.shape[2],1) #reshapes the image.
General Notes
The main steps for building a Neural Network are:
o Define the model structure (such as number of input features and outputs)
o Initialize the model's parameters.
o Loop.
Calculate current loss (forward propagation)
Calculate current gradient (backward propagation)
Update parameters (gradient descent)
Preprocessing the dataset is important.
Tuning the learning rate (which is an example of a "hyperparameter") can make a big difference to the algorithm.
kaggle.com is a good place for datasets and competitions.
Pieter Abbeel is one of the best in deep reinforcement learning.
X1 \
X2 ==> z = XW + B ==> a = Sigmoid(z) ==> l(a,Y)
X3 /
X1 \
X2 => z1 = XW1 + B1 => a1 = Sigmoid(z1) => z2 = a1W2 + B2 => a2 = Sigmoid(z2) => l(a2,Y)
X3 /
X is the input vector (X1, X2, X3), and Y is the output variable (1x1)
for i = 1 to m
z[1, i] = W1*x[i] + b1 # shape of z[1, i] is (noOfHiddenNeurons,1)
a[1, i] = sigmoid(z[1, i]) # shape of a[1, i] is (noOfHiddenNeurons,1)
z[2, i] = W2*a[1, i] + b2 # shape of z[2, i] is (1,1)
a[2, i] = sigmoid(z[2, i]) # shape of a[2, i] is (1,1)
Lets say we have X on shape (Nx,m). So the new pseudo code:
Z1 = W1X + b1 # shape of Z1 (noOfHiddenNeurons,m)
A1 = sigmoid(Z1) # shape of A1 (noOfHiddenNeurons,m)
Z2 = W2A1 + b2 # shape of Z2 is (1,m)
A2 = sigmoid(Z2) # shape of A2 is (1,m)
In the last example we can call X = A0. So the previous step can be rewritten as:
Z1 = W1A0 + b1 # shape of Z1 (noOfHiddenNeurons,m)
A1 = sigmoid(Z1) # shape of A1 (noOfHiddenNeurons,m)
Z2 = W2A1 + b2 # shape of Z2 is (1,m)
A2 = sigmoid(Z2) # shape of A2 is (1,m)
Activation functions
So far we are using sigmoid, but in some cases other functions can be a lot better.
Sigmoid can lead us to gradient decent problem where the updates are so low.
Sigmoid activation function range is [0,1] A = 1 / (1 + np.exp(-z)) # Where z is the input matrix
Tanh activation function range is [-1,1] (Shifted version of sigmoid function)
o In NumPy we can implement Tanh using one of these methods: A = (np.exp(z) - np.exp(-z)) / (np.exp(z) +
np.exp(-z)) # Where z is the input matrix
Or A = np.tanh(z) # Where z is the input matrix
It turns out that the tanh activation usually works better than sigmoid activation function for hidden units because the
mean of its output is closer to zero, and so it centers the data better for the next layer.
Sigmoid or Tanh function disadvantage is that if the input is too small or too high, the slope will be near zero which will
cause us the gradient decent problem.
One of the popular activation functions that solved the slow gradient decent is the RELU function. RELU = max(0,z) # so
if z is negative the slope is 0 and if z is positive the slope remains linear.
So here is some basic rule for choosing activation functions, if your classification is between 0 and 1, use the output
activation as sigmoid and the others as RELU.
Leaky RELU activation function different of RELU is that if the input is negative the slope will be so small. It works as RELU
but most people uses RELU. Leaky_RELU = max(0.01z,z) #the 0.01 can be a parameter for your algorithm.
In NN you will decide a lot of choices like:
o No of hidden layers.
o No of neurons in each hidden layer.
o Learning rate. (The most important parameter)
o Activation functions.
o And others..
It turns out there are no guide lines for that. You should try all activation functions for example.
g(z) = 1 / (1 + np.exp(-z))
g'(z) = (1 / (1 + np.exp(-z))) * (1 - (1 / (1 + np.exp(-z))))
g'(z) = g(z) * (1 - g(z))
g(z) = np.maximum(0,z)
g'(z) = { 0 if z < 0
1 if z >= 0 }
g(z) = np.maximum(0.01 * z, z)
g'(z) = { 0.01 if z < 0
1 if z >= 0 }
o NN parameters:
n[0] = Nx
n[1] = NoOfHiddenNeurons
n[2] = NoOfOutputNeurons = 1
W1 shape is (n[1],n[0])
b1 shape is (n[1],1)
W2 shape is (n[2],n[1])
b2 shape is (n[2],1)
o Cost function I = I(W1, b1, W2, b2) = (1/m) * Sum(L(Y,A2))
o Repeat:
o Compute predictions (y'[i], i = 0,...m)
o Get derivatives: dW1, db1, dW2, db2
o Update: W1 = W1 - LearningRate * dW1
o b1 = b1 - LearningRate * db1
o W2 = W2 - LearningRate * dW2
o b2 = b2 - LearningRate * db2
Forward propagation:
Z1 = W1A0 + b1 # A0 is X
A1 = g1(Z1)
Z2 = W2A1 + b2
A2 = Sigmoid(Z2) # Sigmoid because the output is between 0 and 1
Backpropagation (derivations):
Random Initialization
In logistic regression it wasn't important to initialize the weights randomly, while in NN we have to initialize them
randomly.
If we initialize all the weights with zeros in NN it won't work (initializing bias with zero is OK):
o all hidden units will be completely identical (symmetric) - compute exactly the same function
o on each gradient descent iteration all the hidden units will always update the same
We need small values because in sigmoid (or tanh), for example, if the weight is too large you are more likely to end up
even at the very start of training with very large values of Z. Which causes your tanh or your sigmoid activation function to
be saturated, thus slowing down learning. If you don't have any sigmoid or tanh activation functions throughout your
neural network, this is less of an issue.
Constant 0.01 is alright for 1 hidden layer networks, but if the NN is deep this number can be changed but it will always be
a small number.
We can't compute the whole layers forward propagation without a for loop so its OK to have a for loop here.
The dimensions of the matrices are so important you need to figure it out.
a. Image ==> Edges ==> Face parts ==> Faces ==> desired face
o
Audio ==> Low level sound features like (sss,bb) ==> Phonemes ==> Words ==> Sentences
Neural Researchers think that deep neural networks "think" like brains (simple ==> complex)
Circuit theory and deep learning:
o
When starting on an application don't start directly by dozens of hidden layers. Try the simplest solutions (e.g. Logistic
Regression), then try the shallow neural network and so on.
Input A[l-1]
Z[l] = W[l]A[l-1] + b[l]
A[l] = g[l](Z[l])
Output A[l], cache(Z[l])
Parameters vs Hyperparameters
Main parameters of the NN is W and b
Hyper parameters (parameters that control the algorithm) are like:
o Learning rate.
o Number of iteration.
o Number of hidden layers L.
o Number of hidden units n.
o Choice of activation functions.
You have to try values yourself of hyper parameters.
In the earlier days of DL and ML learning rate was often called a parameter, but it really is (and now everybody call it) a
hyperparameter.
On the next course we will see how to optimize hyperparameters.