Introtodeeplearning MIT 6.S191
Introtodeeplearning MIT 6.S191
Introtodeeplearning MIT 6.S191
Learning from the things are working later ~ learning features from underlying
features.
Perceptron :
Activation Function :
Loss Optimisation : finding the network weights that achieve the lowest loss through
maybe gradient descent.
Backpropagation : chain rule ~
Using a single point for gradient descent in very noisy and using all the data
points is very computationally expensive , therefore the solution is ~
Using mini batches of points ~ and true is calculated by taking the average ,
therefore smoother convergence and allows greater learning rates , enables
parallelism therefore faster.
If word prediction then if using BagOfWords then the order is not reserved therefore
Separate parameters should be used to encode the sentence.
RNNs have a loop in them that enables them to maintain an internal state , not like a
vanilla NN where the final output is only present.
Backpropagation through time :
Multiply many small numbers together -> errors which are in further back
time steps keep having smaller and smaller gradients -> this bias es parameters to
capture dependencies in models , thus standard RNNs becoming less capable.
1.
2.
3.
4.
RNN Applications :
Example :
Convolution preserves the features , multiply the features of the patch and the same
size patch its being compared to -> then if the whole thing is outcoming 1 then -> its
an exact match.
Computation of class scores can be outputted as a dense layer representing the
probability of each class.
1.
We have to define how many features to detect.
2.
RELU takes input any real number and it shifts any number less than zero to
zero and greater than zero it keeps the same. (min of everything is 0)
In DEEP CNNs we can stack layers to bring out the features , low -> mid -> high
PART 2 : Classification
The first half can remain the same , but the end can be altered to fit.
EXAMPLEs :
1. The picture is fed into a convolution feature extractor then its decoded /
upscaled.
2. Class of the features are found out.
3. Upsampling is performed using transpose convolution
Generative Modeling :
How can we model some probability distribution that's similar to the true
distribution.
Generative models :
Debiasing
Capability to uncover the underlying features.
Outlier Detection
Latent Variable : they are not directly visible but are the things that actual matter.
Autoencoders :
This loss function doesn't have any labels.
Instead of a deterministic layer there is a stochastic sampling operation , i.e for each
learn a meu and sigma (deviations) that represent the probability distribution.
Regularisation term that is used to formulate the total loss. The Gaussian term.
We cannot backpropagate gradients through a sampling layer , due to their
stochastic nature as bp requires chain rule.
Slow tuning of the latent variable (increase or decrease) , then run the decoder to get
the output. Therefore it forms a semantic meaning.
By perturbing the value of a single latent variable we actually get to know what they
mean and represent.
All other variables are fixed other and then these are fine tuned accordingly ,
Disentanglement lets the model learn features not being correlated to one another.
orientation of face
This forces the discriminator to become as better as possible to differentiate the real
from the created one.
1.
2.
DEEP REINFORCEMENT LEARNING
Reinforcement Learning :
Data given in state action pairs , the goal is to maximise future rewards over many
time steps.
Agent :
The thing that takes the actions.
Environment :
The place where the agent acts
Actions :
The move that the agent can take in the environment.
State :
A situation where the agent perceives.
Goal of RL is to maximise the reward i.e the feedback or success/failure of the agent.
Total Reward exists.
Discounted Rewards :
Multiplying the discounting factor with reward and take the total sum , the
discounting factor decreases over time (its between 0 and 1).
Higher the Q value tells the action is more desirable in this state.
Two types of DRL algorithms exists :
Value Learning :
The Q Function is learnt then use it to determine the policy
Its difficult to estimate the Q function , choosing the (s,a) pair is difficult.
Maximise the target return -> that will train the agent.
Run until termination is not actually feasible in real life , photo realistic simulators
can be used.
EXAMPLEs :
1.
This uses a build up of user data.
2.
Limitations :
1. Number of hidden layers may be unfeasibly large.
2. The resulting model may not generalize
Deep neural networks are very good at perfectly fitting random function.
1. Take some data instance then perform some perturbation then it comes out
as a nonsensical label for the original object.
2.
GCNs : instead of images graphs are present , the weights are moved across the
nodes.
Also PointCloud analysis can be implemented : the data points are distributed such
that enables capturing the spacial information.
Probability is not a measure of confidence.
Along with the probability , confidence should also be outputted.
By using different dropout gives us many estimates and then the variances can be
observed , lower the spread out of the uncertainty better the prediction.
Thus we are actually learning the Evidential Distribution , effectively it says how
much evidence the model has in support of a prediction.
Then =>