2 DNN-CNN-RNN
2 DNN-CNN-RNN
2 DNN-CNN-RNN
Intelligence Laboratory
Slide made by
Hyungwon Choi and Yunhun Jang
KAIST EE
• Human Learning
Learning perceptions
Objects
Lots of Data
Edge Parts
Multi-layer Data Representations (feature hierarchy)
Algorithmic Intelligence Laboratory 3
Deep Learning = Feature Learning
• Why deep learning outperforms other machine learning (ML) approaches for
vision, speech, language?
SIFT
4. Question
• Why is it difficult to train a deep neural network?
4. Question
• Why is it difficult to train a deep neural network?
Bias Nonlinear
activation
function
• Similarities
• Consists of neurons & connections between neurons
• Learning process = Update of connections
• Massive parallel processing
• Differences
• Computation within neuron vastly simplified
• Discrete time steps
• Typically some of supervised learning with massive number of stimuli
Summation
Inputs
Weights
…
Hidden
• Training dataset
• : input data
• : target data (or label for classification)
• Input data
1.0
-0.5
0.79
1.0
0.92
-0.5
0.16
where
• Compute output
0.79
1.0
0.92 0.62
-0.5
0.16
• Objective: Find a parameter that minimizes the error (or empirical risk)
learning rate
• Backpropagation
• First adjust the last layer weights
• Propagate error back to each previous layers
• Adjust previous layer weights
0.79
1.0
0.92 0.62
• Compute
-0.5
0.16
• Compute
0.79
1.0
0.92 0.62
• Update with
-0.5
0.16
• Compute
0.79
1.0
0.92 0.62
• Update with
-0.5
0.16
4. Question
• Why is it difficult to train a deep neural network?
• Drawbacks
• (-) Large number of parameters
• Easy to be over-fitted
• Large memory consumption
• Weight sharing
• Apply same weights over the different spatial regions
• One can achieve translation invariance (not perfect though)
• Weight sharing
• Apply same weights over the different spatial regions
• One can achieve translation invariance
• Translation invariance
• When input is changed spatially (translated or shifted), the corresponding output
to recognize the object should not be changed
• CNN can produce the same output even though the input image is shifted due to
weight sharing
1 1
3072 10×3072 weights 10
Convolution layer
32×32×3 image
5×5×3 filter
(equivalent to 1×75 weights for FC layer)
32
Convolve the filter with the image
i.e., “slide over the image spatially,
computing dot products”
32
3
1 1
3072 10×3072 weights 10
Convolution layer
32×32×3 image
5×5×3 filter
(equivalent to 1×75 weights for FC layer)
32
The result of taking a dot product between the filter
and a small 5×5×3 chunk of the image
(i.e., 5×5×3 = 75-dimensional dot product + bias)
32
3
1 1
3072 10×3072 weights 10
Convolution layer
32×32×3 image Activation map
5×5×3 filter
(equivalent to 1×75 weights for FC layer)
32
28
1 1
3072 10×3072 weights 10
Convolution layer
32×32×3 image 4 separate activation maps
If there are four 5×5×3 filters
32 28
32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32
32
3
Output volume size = ?
32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32
32 32
Output volume size = ?
3 10
• (32 + 2×2 - 5)/1 + 1 = 32 spatially
• = > 32×32×10
32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32
32 32
3 10
Number of parameters in this layer?
32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32
32 32
Number of parameters in this layer?
3 10
• Each filter has 5×5×3 + 1 = 76 params ( +1 for bias )
• = > 76×10 = 760
32×32×3 image
32 28 24
20
ReLU LeakyReLU
• Choices of other non-linearity
• Tanh/Sigmoid
• ReLU [Nair et al., 2010]
• Leaky ReLU [Maas et. al., 2013]
*reference: http://cs231n.stanford.edu/2017/
Algorithmic Intelligence Laboratory *Image source: https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6 53
CNN: Pooling
• Pooling layer
• Makes the representations smaller and more manageable
• Operates over each activation map independently
• Enhance translation invariance (invariance to small transformation)
• Larger receptive fields (see more of input)
• Regularization effect
224×224×64
112×112×64
Pooling
224 112
Downsampling
112
224
Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 54
CNN: Pooling
ROI pooling
*source:
https://deepsense.ai/region-of-interest-pooling-explained/
http://mlss.tuebingen.mpg.de/2015/slides/fergus/Fergus_1.pdf
Algorithmic Intelligence Laboratory https://vaaaaaanquish.hatenablog.com/entry/2015/01/26/060622 55
CNN: Visualization
4. Question
• Why is it difficult to train a deep neural network ?
• Simple RNN
• The state consists of a single “hidden” vector
• Vanilla RNN (or sometimes called Elman RNN)
• Character-level
language model
• Vocabulary : [h,e,l,o]
• Example training
sequence : “hello” 0.3 1.0 0.1 -0.3
Hidden layer -0.1 0.3 -0.5 0.9
0.9 0.1 -0.3 0.7
1 0 0 0
0 1 0 0
Input layer
0 0 1 1
0 0 0 0
• Example training
sequence : “hello” 0.3 1.0 0.1 -0.3
Hidden layer -0.1 0.3 -0.5 0.9
0.9 0.1 -0.3 0.7
1 0 0 0
0 1 0 0
Input layer
0 0 1 1
0 0 0 0
1 0 0 0
0 1 0 0
Input layer
0 0 1 1
0 0 0 0
Input chars: “h” “e” “l” “l”
4. Question
• Why is it difficult to train a deep neural network?
• Can we just simply stack multiple layers and train them all?
• Unfortunately, it does not work well
• Even if we have infinite amount of computational resource
Gradients > 1
• Possible solutions
• Activation functions
• CNN: Residual networks [He et al., 2016]
• RNN: LSTM (Long Short-Term Memory)
*source
https://mediatum.ub.tum.de/doc/673554/file.pdf
Algorithmic Intelligence Laboratory https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092 81
Solving Vanishing Gradient: Activation Functions
*source: https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092
Algorithmic Intelligence Laboratory 82
Solving Vanishing Gradient: Activation Functions
leaky ReLU
𝑓 𝑥 = max(0.01𝑥, 𝑥)
input
PReLU, RReLU
𝑓 𝑥 = max(𝑥/𝑎, 𝑥)
3 main steps:
• Forget irrelevant parts of previous state
• Selectively update the cell state based on the
new input
• Selectively decide what part of the cell state to
output as the new hidden state
• [Nair et al., 2010] "Rectified linear units improve restricted boltzmann machines." ICML 2010.
link : https://dl.acm.org/citation.cfm?id=3104425
• [Krizhevsky et al., 2012] "Imagenet classification with deep convolutional neural networks." NIPS 2012
link : https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
• [Maas et al., 2013] "Rectifier nonlinearities improve neural network acoustic models." ICML 2013.
link : https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf
• [Farabet et al., 2013] "Learning hierarchical features for scene labeling." IEEE transactions on PAMI 2013
link : https://www.ncbi.nlm.nih.gov/pubmed/23787344
• [Zeiler et al., 2014] "Visualizing and understanding convolutional networks." ECCV 2014.
link : https://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
• [Simonyan et al., 2015] "Very deep convolutional networks for large-scale image recognition.” ICLR 2015
link : https://arxiv.org/abs/1409.1556
• [Ren et al., 2015] "Faster r-cnn: Towards real-time object detection with region proposal networks." NIPS 2015
link : https://arxiv.org/abs/1506.01497
• [Vinyals et al., 2015] "Show and tell: A neural image caption generator." CVPR 2015.
link : https://arxiv.org/abs/1411.4555
• [Karpathy et al., 2015] "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
link : https://cs.stanford.edu/people/karpathy/cvpr2015.pdf
• [He et al., 2015] "Delving deep into rectifiers: Surpassing human-level performance on imagenet
classification." ICCV 2015.
link : https://arxiv.org/abs/1502.01852
• [Xu et al., 2015] "Empirical evaluation of rectified activations in convolutional network." arXiv preprint, 2015.
link : https://arxiv.org/abs/1505.00853
• [Shang et al., 2016] "Understanding and improving convolutional neural networks via concatenated rectified
linear units." ICML 2016.
link : https://arxiv.org/abs/1603.05201
• [He et al., 2016] "Deep residual learning for image recognition." CVPR 2016
link : https://arxiv.org/abs/1512.03385
• [Cae et al., 2017] "Realtime multi-person 2d pose estimation using part affinity fields.", CVPR 2017
link : https://arxiv.org/abs/1611.08050
• [Fei-Fei and Karpathy, 2017] “CS231n: Convolutional Neural Networks for Visual Recognition”, 2017. (Stanford
University)
link : http://cs231n.stanford.edu/2017/