2 DNN-CNN-RNN

Download as pdf or txt
Download as pdf or txt
You are on page 1of 87

Algorithmic

Intelligence Laboratory

Introduction to Neural Networks:


DNN / CNN / RNN

EE807: Recent Advances in Deep Learning


Lecture 1

Slide made by
Hyungwon Choi and Yunhun Jang
KAIST EE

Algorithmic Intelligence Laboratory


What is Machine/Deep Learning?

• Human Learning

• Machine Learning = Build an algorithm from data


• Deep learning is a special type of algorithms in machine learning

Learning perceptions

Algorithmic Intelligence Laboratory


Learning interactions 2
Definition of Deep Learning

• An algorithm that learns multiple levels of abstractions in data

Deep & Large Networks

Objects

Lots of Data
Edge Parts
Multi-layer Data Representations (feature hierarchy)
Algorithmic Intelligence Laboratory 3
Deep Learning = Feature Learning

• Why deep learning outperforms other machine learning (ML) approaches for
vision, speech, language?

Input Feature Extraction Other ML Output

SIFT

Input Deep Network Output

Algorithmic Intelligence Laboratory 4


Table of Contents

1. Deep Neural Networks (DNN)


• Basics
• Training : Back propagation

2. Convolutional Neural Networks (CNN)


• Basics
• Convolution and pooling
• Some applications

3. Recurrent Neural Networks (RNN)


• Basics
• Character-level language model (example)

4. Question
• Why is it difficult to train a deep neural network?

Algorithmic Intelligence Laboratory 5


Table of Contents

1. Deep Neural Networks (DNN)


• Basics
• Training : Back propagation

2. Convolutional Neural Networks (CNN)


• Basics
• Convolution and pooling
• Some applications

3. Recurrent Neural Networks (RNN)


• Basics
• Character-level language model (example)

4. Question
• Why is it difficult to train a deep neural network?

Algorithmic Intelligence Laboratory 6


DNN: Neurons in the Brain

• Human brain is made up of 100 billion neurons


• Neurons receive electric signals at the dendrites and send them to the axon
• Dendrites can perform complex non-linear computations
• Synapses are not a single weight but a complex non-linear dynamical system

Algorithmic Intelligence Laboratory *source : https://pt.slideshare.net/hammawan/deep-neural-networks 7


DNN: Artificial Neural Networks

• Artificial neural networks


• A simplified version of biological neural network

Bias Nonlinear
activation
function

Output / activation of the neuron


Summation
Inputs
Weights

Algorithmic Intelligence Laboratory 8


DNN: The Brain vs. Artificial Neural Networks

• Similarities
• Consists of neurons & connections between neurons
• Learning process = Update of connections
• Massive parallel processing

• Differences
• Computation within neuron vastly simplified
• Discrete time steps
• Typically some of supervised learning with massive number of stimuli

Algorithmic Intelligence Laboratory *source : http://mt-class.org/jhu/slides/lecture-nn-intro.pdf 9


DNN: Basics

• Deep neural networks


• Neural network with more than 2 layers
• Can model more complex functions

Nonlinear Inputs Outputs


Bias
activation
function

Summation
Inputs
Weights

Hidden

“2-layer Neural Net”


“1-hidden-layer Neural Net”

Algorithmic Intelligence Laboratory 10


DNN: Notation

• Training dataset
• : input data
• : target data (or label for classification)

• Neural network parameterized by

Next, forward propagation


Algorithmic Intelligence Laboratory 11
DNN: Forward Propagation

• Forward propagation: calculate the output of the neural network

where is activation function (e.g., sigmoid function) and is number of layers

Algorithmic Intelligence Laboratory 12


DNN: Forward Propagation (Example)

Algorithmic Intelligence Laboratory 13


DNN: Forward Propagation (Example)

• Input data

1.0

-0.5

Algorithmic Intelligence Laboratory 14


DNN: Forward Propagation (Example)

• Compute hidden units

0.79
1.0

0.92

-0.5
0.16

where

Algorithmic Intelligence Laboratory 15


DNN: Forward Propagation (Example)

• Compute output

0.79
1.0

0.92 0.62

-0.5
0.16

Next, training objective


Algorithmic Intelligence Laboratory 16
DNN: Objective

• Objective: Find a parameter that minimizes the error (or empirical risk)

where is a loss function e.g., MSE(Mean square error) or cross entropy

Next, how to optimize ?


Algorithmic Intelligence Laboratory 17
DNN: Training

• Gradient descent (GD) updates parameters iteratively to the gradient direction.


parameters loss function

learning rate

• Backpropagation
• First adjust the last layer weights
• Propagate error back to each previous layers
• Adjust previous layer weights

Next, backpropagation in details


Algorithmic Intelligence Laboratory 18
DNN: Backpropagation

• Consider the input


• Forward propagation to compute output
• layer intermediate output

Algorithmic Intelligence Laboratory 19


DNN: Backpropagation

• Consider the input


• Forward propagation to compute output
• layer intermediate output
• Compute error (where is MSE loss )

Algorithmic Intelligence Laboratory 20


DNN: Backpropagation

• Consider the input


• Forward propagation to compute output
• layer intermediate output
• Compute error (where is MSE loss )

• Compute derivative of with respect to

Algorithmic Intelligence Laboratory 21


DNN: Backpropagation

• Consider the input


• Forward propagation to compute output
• layer intermediate output
• Compute error (where is MSE loss )

• Compute derivative of with respect to

Algorithmic Intelligence Laboratory 22


DNN: Backpropagation

• Consider the input


• Forward propagation to compute output
• layer intermediate output
• Compute error (where is MSE loss )

• Compute derivative of with respect to

• Parameter update rule learning rate

Algorithmic Intelligence Laboratory 23


DNN: Backpropagation

• Consider the input


• Forward propagation to compute output
• layer intermediate output
• Compute error (where is MSE loss )

• Compute derivative of with respect to

• Parameter update rule learning rate

Algorithmic Intelligence Laboratory 24


DNN: Backpropagation

• Consider the input


• Forward propagation to compute output
• layer intermediate output
• Compute error (where is MSE loss )

• Similarly, we can compute gradients with respect to


• And update using the same update rule

Algorithmic Intelligence Laboratory 25


DNN: Backpropagation (Example)

• Compute the error

0.79
1.0

0.92 0.62
• Compute
-0.5
0.16

Algorithmic Intelligence Laboratory 26


DNN: Backpropagation (Example)

• Compute

0.79
1.0

0.92 0.62
• Update with
-0.5
0.16

Algorithmic Intelligence Laboratory 27


DNN: Backpropagation (Example)

• Compute

0.79
1.0

0.92 0.62
• Update with
-0.5
0.16

• Similarly, we can update

Algorithmic Intelligence Laboratory 28


Table of Contents

1. Deep Neural Networks (DNN)


• Basics
• Training : Back propagation

2. Convolutional Neural Networks (CNN)


• Basics
• Convolution and Pooling
• Some applications

3. Recurrent Neural Networks (RNN)


• Basics
• Character-level language model (example)

4. Question
• Why is it difficult to train a deep neural network?

Algorithmic Intelligence Laboratory 29


CNN: Drawbacks of Fully-Connected DNN

• Previous DNNs use fully-connected layers


• Connect all the neurons between the layers

• Drawbacks
• (-) Large number of parameters
• Easy to be over-fitted
• Large memory consumption

• (-) Does not enforce any structure, e.g., local information


• In many applications, local features are important, e.g., images, language, etc.

Algorithmic Intelligence Laboratory 30


CNN: Basics

• Weight sharing and local connectivity (convolution)


• Use multiple filters convolve over inputs
• (+) Reduce the number of parameters (less over-fitting)
• (+) Learn local features
• (+) Translation invariance

• Pooling (or subsampling)


• Make the representations smaller
• (+) Reduce number of parameters and computation

Algorithmic Intelligence Laboratory *source : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.121.1794&rep=rep1&type=pdf 30


CNN: Weight Sharing and Translation Invariance

• Weight sharing
• Apply same weights over the different spatial regions
• One can achieve translation invariance (not perfect though)

Algorithmic Intelligence Laboratory *source : https://www.cc.gatech.edu/~san37/post/dlhc-cnn/ 32


CNN: Weight Sharing and Translation Invariance

• Weight sharing
• Apply same weights over the different spatial regions
• One can achieve translation invariance

• Translation invariance
• When input is changed spatially (translated or shifted), the corresponding output
to recognize the object should not be changed
• CNN can produce the same output even though the input image is shifted due to
weight sharing

Algorithmic Intelligence Laboratory *source : https://www.cc.gatech.edu/~san37/post/dlhc-cnn/ 33


CNN: Convolution
The result of taking a dot product
Fully-connected layer between a row of and the input
• 32×32×3 image à stretch to 3072×1
Input Activation

1 1
3072 10×3072 weights 10

Convolution layer
32×32×3 image
5×5×3 filter
(equivalent to 1×75 weights for FC layer)
32
Convolve the filter with the image
i.e., “slide over the image spatially,
computing dot products”

32
3

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 34


CNN: Convolution
The result of taking a dot product
Fully-connected layer between a row of and the input
• 32×32×3 image à stretch to 3072×1
Input Activation

1 1
3072 10×3072 weights 10

Convolution layer
32×32×3 image
5×5×3 filter
(equivalent to 1×75 weights for FC layer)
32
The result of taking a dot product between the filter
and a small 5×5×3 chunk of the image
(i.e., 5×5×3 = 75-dimensional dot product + bias)

32
3

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 35


CNN: Convolution
The result of taking a dot product
Fully-connected layer between a row of and the input
• 32×32×3 image à stretch to 3072×1
Input Activation

1 1
3072 10×3072 weights 10

Convolution layer
32×32×3 image Activation map
5×5×3 filter
(equivalent to 1×75 weights for FC layer)
32
28

Convolve (slide) over all spatial locations 28


32
1
3

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 36


CNN: Convolution
The result of taking a dot product
Fully-connected layer between a row of and the input
• 32×32×3 image à stretch to 3072×1
Input Activation

1 1
3072 10×3072 weights 10

Convolution layer
32×32×3 image 4 separate activation maps
If there are four 5×5×3 filters
32 28

Convolve (slide) over all spatial locations 28


32
4
3

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 37


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 1
7

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 38


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 1
7

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 39


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 1
7

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 40


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 1
7

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 41


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 1
7
à 5×5 output

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 42


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 2
7

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 43


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 2
7

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 44


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 2
7
à 3×3 output

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 45


CNN: An Example

• Closer look at spatial dimensions

7×7 input (spatially)


Assume 3×3 filter
Applied with stride 3 ?
7
Doesn’t fit!
Cannot apply 3×3 filter on
7×7 input with stride 3

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 46


CNN: An Example

• In practice: Common to zero pad the border


• Used to control the output filter size

0 0 0 0 0 0 0 0 0 7×7 input (spatially)


0 0 Zero pad 1 pixel border
Assume 3×3 filter
0 0 Applied with stride 3
0 0
à 3×3 output
0 0 9
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 47


CNN: An Example (Animation)

No padding, stride 1 Padding 1, stride 1

Padding 1, stride 2 (odd)

No padding, stride 2 Padding 1, stride 2

Algorithmic Intelligence Laboratory *source : https://github.com/vdumoulin/conv_arithmetic 48


CNN: An Example

32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32

32
3
Output volume size = ?

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 49


CNN: An Example

32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32

32 32
Output volume size = ?
3 10
• (32 + 2×2 - 5)/1 + 1 = 32 spatially
• = > 32×32×10

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 50


CNN: An Example

32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32

32 32
3 10
Number of parameters in this layer?

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 51


CNN: An Example

32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32

32 32
Number of parameters in this layer?
3 10
• Each filter has 5×5×3 + 1 = 76 params ( +1 for bias )
• = > 76×10 = 760

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 52


CNN: Convolution

• ConvNet is a sequence of Convolutional layers, followed by non-linearity

32×32×3 image

32 28 24
20

Conv, Conv, Conv,


ReLU ReLU 24 ReLU 20
28 10
e.g., 4 e.g., 6 6 e.g., 10
32
5×5×3 4 5×5×4 5×5×6
3 filters filters filters

ReLU LeakyReLU
• Choices of other non-linearity
• Tanh/Sigmoid
• ReLU [Nair et al., 2010]
• Leaky ReLU [Maas et. al., 2013]

*reference: http://cs231n.stanford.edu/2017/
Algorithmic Intelligence Laboratory *Image source: https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6 53
CNN: Pooling

• Pooling layer
• Makes the representations smaller and more manageable
• Operates over each activation map independently
• Enhance translation invariance (invariance to small transformation)
• Larger receptive fields (see more of input)
• Regularization effect

224×224×64
112×112×64

Pooling

224 112
Downsampling
112
224
Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 54
CNN: Pooling

• Max pooling and average pooling


• With 2×2 filters and stride 2

ROI pooling

• Another kind of pooling layers are also used


• e.g. stochastic pooling, ROI pooling

*source:
https://deepsense.ai/region-of-interest-pooling-explained/
http://mlss.tuebingen.mpg.de/2015/slides/fergus/Fergus_1.pdf
Algorithmic Intelligence Laboratory https://vaaaaaanquish.hatenablog.com/entry/2015/01/26/060622 55
CNN: Visualization

• Visualization of CNN feature representations [Zeiler et al., 2014]


• VGG-16 [Simonyan et al., 2015]

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 56


CNN in Computer Vision: Everywhere

Classification and retrieval [Krizhevsky et al., 2012]

Algorithmic Intelligence Laboratory 57


CNN in Computer Vision: Everywhere

Detection [Ren et al., 2015] Segmentation [Farabet et al., 2013]

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 58


CNN in Computer Vision: Everywhere
Self-driving cars Human pose estimation [Cae et al., 2017]

Image captioning [Vinyals et al., 2015][Karpathy et al., 2015]

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 59


Table of Contents

1. Deep Neural Networks (DNN)


• Basics
• Training : Back propagation

2. Convolutional Neural Networks (CNN)


• Basics
• Convolution and Pooling
• Some applications

3. Recurrent Neural Networks (RNN)


• Basics
• Character-level language model (example)

4. Question
• Why is it difficult to train a deep neural network ?

Algorithmic Intelligence Laboratory 60


RNN: Basics

• CNN models spatial invariance information

• Recurrent Neural Network (RNN)


• Models temporal information
• Hidden states as a function of inputs and previous time step
information

• Temporal information is important in many applications


• Language
• Speech
• Video

Algorithmic Intelligence Laboratory 61


RNN: Basics

• Process a sequence of vectors by applying


recurrence formula at every time step :

New state Old state Input


vector at
time step t

Function parameterized by e.g, DNN, CNN

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 62


RNN: Basics

• Process a sequence of vectors by applying


recurrence formula at every time step :

• Same function and the same set of parameters


are used at every time step

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 63


RNN: Vanilla RNN

• Simple RNN
• The state consists of a single “hidden” vector
• Vanilla RNN (or sometimes called Elman RNN)

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 64


RNN: Computation Graph

Re-use the same weight matrix at every time step

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 65


RNN: Computation Graph (Many to Many)

e.g., Machine Translation


(Sequence of words à Sequence of words)

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 66


RNN: Computation Graph (Many to One)

e.g., Sentiment Classification


(Sequence of words à sentiment) Good paper or not?

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 67


RNN: Computation Graph (One to Many)

e.g., Image Captioning


(Image à sequence of words)

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 68


RNN: An Example

• Character-level
language model

• Vocabulary : [h,e,l,o]

• Example training
sequence : “hello” 0.3 1.0 0.1 -0.3
Hidden layer -0.1 0.3 -0.5 0.9
0.9 0.1 -0.3 0.7

1 0 0 0
0 1 0 0
Input layer
0 0 1 1
0 0 0 0

Input chars: “h” “e” “l” “l”

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 69


RNN: An Example

• Character-level Target chars: “e” “l” “l” “o”


language model
1.0 0.5 0.1 0.2
2.2 0.3 0.5 -1.5
Output layer
• Vocabulary : [h,e,l,o] -3.0 -1.0 1.9 -0.1
4.1 1.2 -1.1 2.2

• Example training
sequence : “hello” 0.3 1.0 0.1 -0.3
Hidden layer -0.1 0.3 -0.5 0.9
0.9 0.1 -0.3 0.7

1 0 0 0
0 1 0 0
Input layer
0 0 1 1
0 0 0 0

Input chars: “h” “e” “l” “l”

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 70


RNN: An Example

• Character-level Samples: “e” “l” “l” “o”


language model .03 .25 .11 .11
.13 .20 .17 .02
Softmax
.00 .05 .68 .08
• Vocabulary : [h,e,l,o] .84 .50 .03 .79

1.0 0.5 0.1 0.2


Output layer 2.2 0.3 0.5 -1.5
• At test time, sample -3.0 -1.0 1.9 -0.1
character one at a 4.1 1.2 -1.1 2.2
time and feedback to
model 0.3 1.0 0.1 -0.3
Hidden layer -0.1 0.3 -0.5 0.9
0.9 0.1 -0.3 0.7

1 0 0 0
0 1 0 0
Input layer
0 0 1 1
0 0 0 0
Input chars: “h” “e” “l” “l”

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 71


RNN: Backpropagation Through Time (BPTT)

• Backpropagation through time (BPTT)


• Forward through entire sequence to compute loss, then backward through
entire sequence to compute gradient

Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 72


Contents

1. Deep Neural Networks (DNN)


• Basics
• Training : Back propagation

2. Convolutional Neural Networks (CNN)


• Basics
• Convolution and Pooling
• Some applications

3. Recurrent Neural Networks (RNN)


• Basics
• Character-level language model (example)

4. Question
• Why is it difficult to train a deep neural network?

Algorithmic Intelligence Laboratory 73


Question

• Why is it difficult to train a deep neural network?

• Can we just simply stack multiple layers and train them all?
• Unfortunately, it does not work well
• Even if we have infinite amount of computational resource

Vanishing gradient problem :


• The magnitude of the gradients shrink exponentially as we backpropagate through
many layers
• Since typical activation functions such as sigmoid or tanh are bounded
• The phenomenon is called vanishing gradient problem

Algorithmic Intelligence Laboratory 74


Vanishing Gradient Problem

• Why do gradients vanish?


• Think of a simplified 3-layer neural network

Algorithmic Intelligence Laboratory 75


Vanishing Gradient Problem

• Why do gradients vanish?


• Think of a simplified 3-layer neural network

• First, let’s update


• Calculate the gradient of the loss with respect to

Algorithmic Intelligence Laboratory 76


Vanishing Gradient Problem

• Why do gradients vanish?


• Think of a simplified 3-layer neural network

• How about ? Gradients < 1


• Calculate the gradient of the loss with respect to

Keep multiplying values < 1 will


decrease the amount exponentially
Algorithmic Intelligence Laboratory 77
Vanishing Gradient Over Time

• This is more problematic in vanilla RNN (with tanh/sigmoid activation)


• When trying to handle long temporal dependency
• Similar to previous example, the gradient vanishes over time

Algorithmic Intelligence Laboratory *source :https://mediatum.ub.tum.de/doc/673554/file.pdf 78


Quiz

• Vanishing gradient problem is critical in training neural network


• Q: Can we just use activation function that has gradients > 1?

Algorithmic Intelligence Laboratory 79


Answer for Quiz

• Not really. It will cause another problem so called exploding gradients.


• Let’s consider if we use exponential activation function:
• The magnitude of gradient is always larger than 1 when input > 0
• If output of the networks are positive, then the gradients to update will explode

Gradients > 1

• This will cause the training very unstable


• The weights will be updated in very large amount, resulting in NaN values
• Very critical problem in training neural networks

Algorithmic Intelligence Laboratory 80


How Can We Overcome Vanishing Gradient Problems?

• Possible solutions
• Activation functions
• CNN: Residual networks [He et al., 2016]
• RNN: LSTM (Long Short-Term Memory)

LSTM (Long Short-Term Memory)

*source
https://mediatum.ub.tum.de/doc/673554/file.pdf
Algorithmic Intelligence Laboratory https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092 81
Solving Vanishing Gradient: Activation Functions

• Use different activation functions that are not bounded:


• Recent works largely use ReLU or their variants
• No saturation, easy to optimize

*source: https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092
Algorithmic Intelligence Laboratory 82
Solving Vanishing Gradient: Activation Functions

• Several generalizations of ReLU


• Leaky ReLU [Maas et. al., 2013]: Introducing non-zero gradient for ‘dying ReLUs’
• Parameteric ReLU (PReLU) [He et al., 2015]: Additional learnable parameter 𝑎 on
leaky ReLUs.
• Randomized ReLU (RReLU) [Xu et al., 2015]: Samples parameter 𝑎 from uniform
distribution. output

leaky ReLU
𝑓 𝑥 = max(0.01𝑥, 𝑥)
input
PReLU, RReLU
𝑓 𝑥 = max(𝑥/𝑎, 𝑥)

• Concatenated ReLU (CReLU) [Shang et al., 2016] output

• ‘Opposite pairs’ of filters found in CNN


CReLU
- Needs to learn twice the information 𝑓 𝑥 = max(−𝑥, 𝑥)
• Two-sided ReLU input

Algorithmic Intelligence Laboratory 83


Solving Vanishing Gradient: Residual Networks

• Residual networks (ResNet [He et al., 2016])


• Feed-forward NN with “shortcut connections”
• Can preserve gradient flow throughout the entire depth of the network
• Possible to train more than 100 layers by simply stacking residual blocks

Plain network Residual network

Algorithmic Intelligence Laboratory 84


Solving Vanishing Gradient: LSTM and GRU

• LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units)


• Specially designed RNN which can remember information for much longer period

3 main steps:
• Forget irrelevant parts of previous state
• Selectively update the cell state based on the
new input
• Selectively decide what part of the cell state to
output as the new hidden state

Preservation of gradient information in LSTM


*source :
http://harinisuresh.com/2016/10/09/lstms/
Algorithmic Intelligence Laboratory https://mediatum.ub.tum.de/doc/673554/file.pdf 85
References

• [Nair et al., 2010] "Rectified linear units improve restricted boltzmann machines." ICML 2010.
link : https://dl.acm.org/citation.cfm?id=3104425
• [Krizhevsky et al., 2012] "Imagenet classification with deep convolutional neural networks." NIPS 2012
link : https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
• [Maas et al., 2013] "Rectifier nonlinearities improve neural network acoustic models." ICML 2013.
link : https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf
• [Farabet et al., 2013] "Learning hierarchical features for scene labeling." IEEE transactions on PAMI 2013
link : https://www.ncbi.nlm.nih.gov/pubmed/23787344
• [Zeiler et al., 2014] "Visualizing and understanding convolutional networks." ECCV 2014.
link : https://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
• [Simonyan et al., 2015] "Very deep convolutional networks for large-scale image recognition.” ICLR 2015
link : https://arxiv.org/abs/1409.1556
• [Ren et al., 2015] "Faster r-cnn: Towards real-time object detection with region proposal networks." NIPS 2015
link : https://arxiv.org/abs/1506.01497
• [Vinyals et al., 2015] "Show and tell: A neural image caption generator." CVPR 2015.
link : https://arxiv.org/abs/1411.4555
• [Karpathy et al., 2015] "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
link : https://cs.stanford.edu/people/karpathy/cvpr2015.pdf
• [He et al., 2015] "Delving deep into rectifiers: Surpassing human-level performance on imagenet
classification." ICCV 2015.
link : https://arxiv.org/abs/1502.01852

Algorithmic Intelligence Laboratory 86


References

• [Xu et al., 2015] "Empirical evaluation of rectified activations in convolutional network." arXiv preprint, 2015.
link : https://arxiv.org/abs/1505.00853
• [Shang et al., 2016] "Understanding and improving convolutional neural networks via concatenated rectified
linear units." ICML 2016.
link : https://arxiv.org/abs/1603.05201
• [He et al., 2016] "Deep residual learning for image recognition." CVPR 2016
link : https://arxiv.org/abs/1512.03385
• [Cae et al., 2017] "Realtime multi-person 2d pose estimation using part affinity fields.", CVPR 2017
link : https://arxiv.org/abs/1611.08050
• [Fei-Fei and Karpathy, 2017] “CS231n: Convolutional Neural Networks for Visual Recognition”, 2017. (Stanford
University)
link : http://cs231n.stanford.edu/2017/

Algorithmic Intelligence Laboratory 87

You might also like