L10 - Intro - To - Deep - Learning
L10 - Intro - To - Deep - Learning
By Google DeepMind
Acoustic Modeling
Near human-level
Text-To-Speech performance
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
Big Data + Representation Learning with Deep Nets
https://research.googleblog.com/2016/09/a-neural-network-for-machine.html
Outline
● Motivation
● A Simple Neural Network
● Ideas in Deep Net Architectures
● Ideas in Deep Net Optimization
● Practicals and Resources
Outline
● Motivation
● A Simple Neural Network
● Ideas in Deep Net Architectures
● Ideas in Deep Net Optimization
● Practicals and Resources
A Simple Neural Network
From CS231N
Neural Network: Forward Pass
x(1): 73.8
x(3): 78.2
From CS231N
Neural Network: Backward Pass
73.8
Minimize:
Given N training pairs:
Neural Network: Backward Pass
Minimize:
Given N training pairs:
Sigmoid function
Non-convex optimization :(
Neural Network: Backward Pass
Minimize:
Given N training pairs:
Non-convex optimization :(
Use gradient descent!
http://playground.tensorflow.org/
Fully Connected
Non-linear Op
Fully Connected
From CS231N
Non-linear Op
Tanh
Sigmoid
From CS231N
Non-linear Op
(pooling)
From CS231N
Pad 1 Pad 1
Stride 2 Stride 1
From vdumoulin/conv_arithmetic
Convolution
Output
3x3x2 array
H’ = (H - K)/stride_H + 1
= (7-3)/2 + 1 = 3
From CS231N
Pooling
Pooling layer (usually inserted in between conv layers) is
used to reduce spatial size of the input, thus reduce
number of parameters and overfitting.
Discarding pooling layers has been found to be important in training good generative models,
such as variational autoencoders (VAEs) or generative adversarial networks (GANs).
It seems likely that future architectures will feature very few to no pooling layers.
From CS231N
LeNet (1998 by LeCun et al.) Fully Connected
Non-linear Op
Convolution
Pooling
(pooling)
AlexNet (2012 by Krizhevsky et al.)
What’s different?
AlexNet (2012 by Krizhevsky et al.)
What’s different? Our network takes between five and six days
to train on two GTX 580 3GB GPUs. -- Alex
256x5x5x128 weights
+ 1x1 conv (256x256 weights)
56x56x128 256x5x5x128 weights + 1x1 conv (256x256 weights)
Tip on ConvNets:
Usually, most computation is
spent on convolutions, while
most space is spent on fully
connected layers.
An Inception Module: a new building block..
ResNet (2016 by Kaiming He et al.) The winner in ILSVRC 2015
ResNet (2016 by Kaiming He et al.)
dog
Segmentation:
x w
11 12 13 14 y
11 12 13
11 12
21 22 23 24
21 22 23
21 22
31 32 33 34
31 32 33
41 42 43 44
Up convolution/Convolution transpose/Deconvolution
x w
11 12 13 14 y
11 12 13
11 12
21 22 23 24
21 22 23
21 22
31 32 33 34
31 32 33
41 42 43 44
Upconvolution/Convolution transpose/Deconvolution
upsample
upconv
Skip links
dilated
conv
conv
conv
Let x be the weight/parameters, dx be the gradient of x. In mini-batch, dx is the average within a batch.
SGD (the vanilla update)
From CS231N
SGD, Momentum, RMSProp, Adagrad, Adam
Per-parameter adaptive learning rate methods
Adagrad by Duchi et al.:
RMSProp by Hinton:
Use moving average to reduce
Adagrad’s aggressive, monotonically
decreasing learning rate
w = np.random.randn(n) / sqrt(n).
w = np.random.randn(n) * sqrt(2/n).
Data “whitening”
...
Summary
● Why Deep Learning
● A Simple Neural Network
○ Model, Loss and Optimization