Introtodeeplearning MIT 6.S191

INTRO TO DEEP LEARNING
Learning from the things are working later ~ learning features from underlying
features.
Big Data -> Hardware -> Software
Perceptron :
Activation Function :
Takes a real number in x axis and converts it into a number in 0 and 1.
Purpose of AF is - to introduce non-linearities in functions.
Hidden layers , dense layer

Quantifying Loss : measures the cost incurred from incorrect predictions.
Empirical Loss : total loss over the entire dataset.
Loss Optimisation : finding the network weights that achieve the lowest loss through
maybe gradient descent.
Backpropagation : chain rule ~
Adaptive Learning Rates ~

Stochastic gradient descent ~ means using a single point.
Using a single point for gradient descent in very noisy and using all the data
points is very computationally expensive , therefore the solution is ~
Using mini batches of points ~ and true is calculated by taking the average ,
therefore smoother convergence and allows greater learning rates , enables
parallelism therefore faster.
Overfitting exists so to deal with it ~
Regularisation ~ technique that constraints our optimisation problem to

discourage complex models ~ improve generalisation.
RECURRENT NEURAL NETWORKS
Sequential Modelling : predict the next thing that will happen after a previous states
are given
If word prediction then if using BagOfWords then the order is not reserved therefore
Separate parameters should be used to encode the sentence.
In standard Feed Forward -> it goes one to one passing of data
RNN for Sequence Modeling :
RNNs have a loop in them that enables them to maintain an internal state , not like a
vanilla NN where the final output is only present.
Backpropagation through time :
In RNNs at each time backprop is done also in total ~

Standard RNN Gradient Flow can have 2 problems :
The problem of LongTermDependencies (vanishing gradient) ~
Multiply many small numbers together -> errors which are in further back
time steps keep having smaller and smaller gradients -> this bias es parameters to
capture dependencies in models , thus standard RNNs becoming less capable.
To solve the above problem :

Both tanh and sigmoid have derivatives less than 1.
Therefore use RELU but x>0 only then.
The above is well suited for learning tasks.
Long Short Term Memory (LSTM) Networks :

Gates are an important part in LSTMs
LSTMs work in 4 steps :
1.
2.
3.
4.
All these allow LSTMs to have an uninterrupted gradient flow.
RNN Applications :
Example :
1. Music Generation ~ at each level the next tune is predicted.

2. Sentiment Classification ~
3. Machine Translation ~ encoding bottleneck is a problem so attention
mechanism is used ~ all states of time steps are accessed and trained
upon individually.
4. Trajectory Prediction ~ Self Driving Car
5. Environmental Modelling
CONVOLUTION NEURAL NETWORKS
Spatial features need to be created in case of IMAGE data.
Using Spatial features :
Deformation should be handled.
Convolution preserves the features , multiply the features of the patch and the same
size patch its being compared to -> then if the whole thing is outcoming 1 then -> its
an exact match.
Computation of class scores can be outputted as a dense layer representing the
probability of each class.
1.
We have to define how many features to detect.
2.
RELU takes input any real number and it shifts any number less than zero to
zero and greater than zero it keeps the same. (min of everything is 0)
3. Downsampling using Pooling :

There are other ways too , to downsample other than MAX-pooling.
In DEEP CNNs we can stack layers to bring out the features , low -> mid -> high
Part 1 : Feature Learning -> extract and learn the features.
PART 2 : Classification
The first half can remain the same , but the end can be altered to fit.
EXAMPLEs :
1. The picture is fed into a convolution feature extractor then its decoded /
upscaled.
2. Class of the features are found out.
3. Upsampling is performed using transpose convolution
This whole thing is end to end , nothing was told explicitly.

DEEP GENERATIVE MODELING
Generative Modeling :
How can we model some probability distribution that's similar to the true
distribution.
Generative models :
Debiasing
Capability to uncover the underlying features.
Outlier Detection
Latent Variable : they are not directly visible but are the things that actual matter.
Autoencoders :
This loss function doesn't have any labels.
Lower the dimensionality lower is the quality of output.

Variational Auto Encoders :
Instead of a deterministic layer there is a stochastic sampling operation , i.e for each
learn a meu and sigma (deviations) that represent the probability distribution.
Regularisation term that is used to formulate the total loss. The Gaussian term.
We cannot backpropagate gradients through a sampling layer , due to their
stochastic nature as bp requires chain rule.
So we can Reparameterize the sampling layer.
Slow tuning of the latent variable (increase or decrease) , then run the decoder to get
the output. Therefore it forms a semantic meaning.
By perturbing the value of a single latent variable we actually get to know what they
mean and represent.
All other variables are fixed other and then these are fine tuned accordingly ,
Disentanglement lets the model learn features not being correlated to one another.
We can use biases to get these.
Mitigating bias through learnt latent structures :
orientation of face
These may be under-represented.

VAEs :
Generative Adversarial Networks : (GANs)

Generator and Discriminator ~ both these are adversaries , so they are fighting with
each other.
This forces the discriminator to become as better as possible to differentiate the real
from the created one.
Intuition behind the GANs :
1. Generator starts from noise to try to create an imitation of the data.

2. Discriminator looks at both the real data and fake data created by the
generator.
3. Discriminator tries to predict what’s real and what’s fake.
4. As the Discriminator becomes perfect the Generator tries to improve its

imitation of the data.
EXAMPLEs :
1.
2.
DEEP REINFORCEMENT LEARNING
Reinforcement Learning :
Data given in state action pairs , the goal is to maximise future rewards over many
time steps.
Agent :
The thing that takes the actions.
Environment :
The place where the agent acts
Actions :
The move that the agent can take in the environment.
State :
A situation where the agent perceives.
Goal of RL is to maximise the reward i.e the feedback or success/failure of the agent.
Total Reward exists.
Discounted Rewards :
Multiplying the discounting factor with reward and take the total sum , the
discounting factor decreases over time (its between 0 and 1).
Higher the Q value tells the action is more desirable in this state.
Two types of DRL algorithms exists :
Value Learning :
The Q Function is learnt then use it to determine the policy
Its difficult to estimate the Q function , choosing the (s,a) pair is difficult.
Maximise the target return -> that will train the agent.
Deep Q network summary :

its total mass must add up to one.
Thus we are not restrained to categorical distribution.
Leading to a Continuous Action Space rather than a discrete action space.

Then keep repeating over and over until it becomes best.
Run until termination is not actually feasible in real life , photo realistic simulators
can be used.
EXAMPLEs :
1.
This uses a build up of user data.
2.
DEEP LEARNING : NEW FRONTIERS
Universal Approximation Theorem : A feedforward network with a layer is sufficient

to approximate an arbitrary precision , any continuous function.
Limitations :
1. Number of hidden layers may be unfeasibly large.
2. The resulting model may not generalize
Deep neural networks are very good at perfectly fitting random function.
Therefore NN can be thought of as “Excellent” function approximators.
Adversarial Attacks on NN :
1. Take some data instance then perform some perturbation then it comes out
as a nonsensical label for the original object.
2.
Neural Network Limitations :

Graphs Convolutional Networks can solve :
GCNs : instead of images graphs are present , the weights are moved across the
nodes.
Learning from 3D data :
Also PointCloud analysis can be implemented : the data points are distributed such
that enables capturing the spacial information.
Probability is not a measure of confidence.
Along with the probability , confidence should also be outputted.
Bayesian Deep Learning for Uncertainty enables that :
Approximates the posterior weights over the normal weights.
Dropouts can be used ~

By looking at the expected value and its variance gives us the uncertainty of the
prediction.
By using different dropout gives us many estimates and then the variances can be
observed , lower the spread out of the uncertainty better the prediction.
Thus we are actually learning the Evidential Distribution , effectively it says how
much evidence the model has in support of a prediction.
Telling us the uncertainty would be the best.

Thus it makes it robust to Adversarial Attacks : thus greater the extent of adversary or
pampering greater the uncertainty.
Auto ML : Automated Machine Learning (made by Google)

The RNN estimates the hyperparameters and using these the child NN is trained and
the accuracy is measured and rewards are then put back into the RNN for better
approximation.
Then =>

Introtodeeplearning MIT 6.S191

Uploaded by

Copyright:

Available Formats

Introtodeeplearning MIT 6.S191

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introtodeeplearning MIT 6.S191

Uploaded by

Copyright:

Available Formats

INTRO TO DEEP LEARNING

Big Data -> Hardware -> Software

Takes a real number in x axis and converts it into a number in 0 and 1.

Purpose of AF is - to introduce non-linearities in functions.

Hidden layers , dense layer

Empirical Loss​ : total loss over the entire dataset.

Adaptive Learning Rates​ ~

Overfitting exists so to deal with it ~

Regularisation​ ~ technique that constraints our optimisation problem to

In standard Feed Forward -> it goes one to one passing of data

RNN for Sequence Modeling​ :

In RNNs at each time backprop is done also in total ~

The problem of LongTermDependencies (vanishing gradient​) ~

To solve the above problem​ :

The above is well suited for learning tasks.

Long Short Term Memory (LSTM) Networks​ :

LSTMs work in 4 steps​ :

All these allow LSTMs to have an uninterrupted gradient flow.

1. Music Generation ~ at each level the next tune is predicted.

Using Spatial features​ :

Deformation should be handled.

3. Downsampling using Pooling :

Part 1 : Feature Learning​ -> extract and learn the features.

This whole thing is end to end , nothing was told explicitly.

Lower the dimensionality lower is the quality of output.

So we can Reparameterize the sampling layer.

We can use biases to get these.

Mitigating bias through learnt latent structures​ :

These may be under-represented.

Generative Adversarial Networks​ : (GANs)

Intuition behind the GANs​ :

1. Generator starts from noise to try to create an imitation of the data.

4. As the Discriminator becomes perfect the Generator tries to improve its

Deep Q network summary​ :

Thus we are not restrained to categorical distribution.

Leading to a Continuous Action Space rather than a discrete action space.

DEEP LEARNING : NEW FRONTIERS

Universal Approximation Theorem​ : A feedforward network with a layer is sufficient

Therefore NN can be thought of as “Excellent” function approximators.

Adversarial Attacks on NN​ :

Neural Network Limitations​ :

Learning from 3D data​ :

Bayesian Deep Learning for Uncertainty​ enables that :

Approximates the posterior weights over the normal weights.

Dropouts can be used​ ~

Telling us the uncertainty would be the best.

Auto ML : Automated Machine Learning (made by Google)

You might also like

Empirical Loss : total loss over the entire dataset.

Adaptive Learning Rates ~

Regularisation ~ technique that constraints our optimisation problem to

RNN for Sequence Modeling :

The problem of LongTermDependencies (vanishing gradient) ~

To solve the above problem :

Long Short Term Memory (LSTM) Networks :

LSTMs work in 4 steps :

Using Spatial features :

Part 1 : Feature Learning -> extract and learn the features.

Mitigating bias through learnt latent structures :

Generative Adversarial Networks : (GANs)

Intuition behind the GANs :

Deep Q network summary :

Universal Approximation Theorem : A feedforward network with a layer is sufficient

Adversarial Attacks on NN :

Neural Network Limitations :

Learning from 3D data :

Bayesian Deep Learning for Uncertainty enables that :

Dropouts can be used ~