2 DNN-CNN-RNN

Algorithmic
Intelligence Laboratory
Introduction to Neural Networks:

DNN / CNN / RNN
EE807: Recent Advances in Deep Learning

Lecture 1
Slide made by
Hyungwon Choi and Yunhun Jang
KAIST EE
Algorithmic Intelligence Laboratory

What is Machine/Deep Learning?
• Human Learning
• Machine Learning = Build an algorithm from data

• Deep learning is a special type of algorithms in machine learning
Learning perceptions
Algorithmic Intelligence Laboratory

Learning interactions 2
Definition of Deep Learning
• An algorithm that learns multiple levels of abstractions in data
Deep & Large Networks
Objects
Lots of Data
Edge Parts
Multi-layer Data Representations (feature hierarchy)
Algorithmic Intelligence Laboratory 3
Deep Learning = Feature Learning
• Why deep learning outperforms other machine learning (ML) approaches for
vision, speech, language?
Input Feature Extraction Other ML Output
SIFT
Input Deep Network Output

Table of Contents
1. Deep Neural Networks (DNN)

• Basics
• Training : Back propagation
2. Convolutional Neural Networks (CNN)

• Basics
• Convolution and pooling
• Some applications
3. Recurrent Neural Networks (RNN)

• Basics
• Character-level language model (example)
4. Question
• Why is it difficult to train a deep neural network?

Table of Contents

• Basics

• Basics
• Convolution and pooling

• Basics
4. Question

DNN: Neurons in the Brain
• Human brain is made up of 100 billion neurons

• Neurons receive electric signals at the dendrites and send them to the axon
• Dendrites can perform complex non-linear computations
• Synapses are not a single weight but a complex non-linear dynamical system
Algorithmic Intelligence Laboratory *source : https://pt.slideshare.net/hammawan/deep-neural-networks 7

DNN: Artificial Neural Networks
• Artificial neural networks

• A simplified version of biological neural network
Bias Nonlinear
activation
function
Output / activation of the neuron

Summation
Inputs
Weights
…

DNN: The Brain vs. Artificial Neural Networks
• Similarities
• Consists of neurons & connections between neurons
• Learning process = Update of connections
• Massive parallel processing
• Differences
• Computation within neuron vastly simplified
• Discrete time steps
• Typically some of supervised learning with massive number of stimuli
Algorithmic Intelligence Laboratory *source : http://mt-class.org/jhu/slides/lecture-nn-intro.pdf 9

DNN: Basics
• Deep neural networks

• Neural network with more than 2 layers
• Can model more complex functions
Nonlinear Inputs Outputs

Bias
activation
function
Summation
Inputs
Weights
…
Hidden
“2-layer Neural Net”

“1-hidden-layer Neural Net”

DNN: Notation
• Training dataset
• : input data
• : target data (or label for classification)
• Neural network parameterized by
Next, forward propagation

DNN: Forward Propagation
• Forward propagation: calculate the output of the neural network
where is activation function (e.g., sigmoid function) and is number of layers

DNN: Forward Propagation (Example)

• Input data
1.0
-0.5

• Compute hidden units
0.79
1.0
0.92
-0.5
0.16
where

• Compute output
0.79
1.0
0.92 0.62
-0.5
0.16
Next, training objective

DNN: Objective
• Objective: Find a parameter that minimizes the error (or empirical risk)
where is a loss function e.g., MSE(Mean square error) or cross entropy
Next, how to optimize ?

DNN: Training
• Gradient descent (GD) updates parameters iteratively to the gradient direction.

parameters loss function
learning rate
• Backpropagation
• First adjust the last layer weights
• Propagate error back to each previous layers
• Adjust previous layer weights
Next, backpropagation in details

DNN: Backpropagation
• Consider the input

• Forward propagation to compute output
• layer intermediate output


• Compute error (where is MSE loss )


• Compute derivative of with respect to




• Parameter update rule learning rate


• Parameter update rule learning rate


• Similarly, we can compute gradients with respect to

• And update using the same update rule

DNN: Backpropagation (Example)
• Compute the error
0.79
1.0
0.92 0.62
• Compute
-0.5
0.16

• Compute
0.79
1.0
0.92 0.62
• Update with
-0.5
0.16

• Compute
0.79
1.0
0.92 0.62
• Update with
-0.5
0.16
• Similarly, we can update

Table of Contents

• Basics

• Basics
• Convolution and Pooling

• Basics
4. Question

CNN: Drawbacks of Fully-Connected DNN
• Previous DNNs use fully-connected layers

• Connect all the neurons between the layers
• Drawbacks
• (-) Large number of parameters
• Easy to be over-fitted
• Large memory consumption
• (-) Does not enforce any structure, e.g., local information

• In many applications, local features are important, e.g., images, language, etc.

CNN: Basics
• Weight sharing and local connectivity (convolution)

• Use multiple filters convolve over inputs
• (+) Reduce the number of parameters (less over-fitting)
• (+) Learn local features
• (+) Translation invariance
• Pooling (or subsampling)

• Make the representations smaller
• (+) Reduce number of parameters and computation
Algorithmic Intelligence Laboratory *source : http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.121.1794&rep=rep1&type=pdf 30

CNN: Weight Sharing and Translation Invariance
• Weight sharing
• Apply same weights over the different spatial regions
• One can achieve translation invariance (not perfect though)
Algorithmic Intelligence Laboratory *source : https://www.cc.gatech.edu/~san37/post/dlhc-cnn/ 32

CNN: Weight Sharing and Translation Invariance
• Weight sharing
• Apply same weights over the different spatial regions
• One can achieve translation invariance
• Translation invariance
• When input is changed spatially (translated or shifted), the corresponding output
to recognize the object should not be changed
• CNN can produce the same output even though the input image is shifted due to
weight sharing
Algorithmic Intelligence Laboratory *source : https://www.cc.gatech.edu/~san37/post/dlhc-cnn/ 33

CNN: Convolution
The result of taking a dot product
Fully-connected layer between a row of and the input
• 32×32×3 image à stretch to 3072×1
Input Activation
1 1
3072 10×3072 weights 10
Convolution layer
32×32×3 image
5×5×3 filter
(equivalent to 1×75 weights for FC layer)
32
Convolve the filter with the image
i.e., “slide over the image spatially,
computing dot products”
32
3
Algorithmic Intelligence Laboratory *reference : http://cs231n.stanford.edu/2017/ 34

CNN: Convolution
Input Activation
1 1
3072 10×3072 weights 10
Convolution layer
32×32×3 image
5×5×3 filter
32
The result of taking a dot product between the filter
and a small 5×5×3 chunk of the image
(i.e., 5×5×3 = 75-dimensional dot product + bias)
32
3

CNN: Convolution
Input Activation
1 1
3072 10×3072 weights 10
Convolution layer
32×32×3 image Activation map
5×5×3 filter
32
28
Convolve (slide) over all spatial locations 28

32
1
3

CNN: Convolution
Input Activation
1 1
3072 10×3072 weights 10
Convolution layer
32×32×3 image 4 separate activation maps
If there are four 5×5×3 filters
32 28
Convolve (slide) over all spatial locations 28

32
4
3

CNN: An Example
• Closer look at spatial dimensions
7×7 input (spatially)

Assume 3×3 filter
Applied with stride 1
7

CNN: An Example

Assume 3×3 filter
7

CNN: An Example

Assume 3×3 filter
7

CNN: An Example

Assume 3×3 filter
7

CNN: An Example

Assume 3×3 filter
7
à 5×5 output

CNN: An Example

Assume 3×3 filter
7

CNN: An Example

Assume 3×3 filter
7

CNN: An Example

Assume 3×3 filter
7
à 3×3 output

CNN: An Example

Assume 3×3 filter
Applied with stride 3 ?
7
Doesn’t fit!
Cannot apply 3×3 filter on
7×7 input with stride 3

CNN: An Example
• In practice: Common to zero pad the border

• Used to control the output filter size
0 0 0 0 0 0 0 0 0 7×7 input (spatially)

0 0 Zero pad 1 pixel border
Assume 3×3 filter
0 0 Applied with stride 3
0 0
à 3×3 output
0 0 9
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0

CNN: An Example (Animation)
No padding, stride 1 Padding 1, stride 1
Padding 1, stride 2 (odd)
No padding, stride 2 Padding 1, stride 2
Algorithmic Intelligence Laboratory *source : https://github.com/vdumoulin/conv_arithmetic 48

CNN: An Example
32×32×3
• Input volume : 32×32×3
• 10 5×5 filters with stride 1, pad 2 32
32
3
Output volume size = ?

CNN: An Example
32×32×3
• 10 5×5 filters with stride 1, pad 2 32 32
32 32
Output volume size = ?
3 10
• (32 + 2×2 - 5)/1 + 1 = 32 spatially
• = > 32×32×10

CNN: An Example
32×32×3
32 32
3 10
Number of parameters in this layer?

CNN: An Example
32×32×3
32 32
Number of parameters in this layer?
3 10
• Each filter has 5×5×3 + 1 = 76 params ( +1 for bias )
• = > 76×10 = 760

CNN: Convolution
• ConvNet is a sequence of Convolutional layers, followed by non-linearity
32×32×3 image
32 28 24
20
Conv, Conv, Conv,

ReLU ReLU 24 ReLU 20
28 10
e.g., 4 e.g., 6 6 e.g., 10
32
5×5×3 4 5×5×4 5×5×6
3 filters filters filters
ReLU LeakyReLU
• Choices of other non-linearity
• Tanh/Sigmoid
• ReLU [Nair et al., 2010]
• Leaky ReLU [Maas et. al., 2013]
*reference: http://cs231n.stanford.edu/2017/
Algorithmic Intelligence Laboratory *Image source: https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6 53
CNN: Pooling
• Pooling layer
• Makes the representations smaller and more manageable
• Operates over each activation map independently
• Enhance translation invariance (invariance to small transformation)
• Larger receptive fields (see more of input)
• Regularization effect
224×224×64
112×112×64
Pooling
224 112
Downsampling
112
224
CNN: Pooling
• Max pooling and average pooling

• With 2×2 filters and stride 2
ROI pooling
• Another kind of pooling layers are also used

• e.g. stochastic pooling, ROI pooling
*source:
https://deepsense.ai/region-of-interest-pooling-explained/
http://mlss.tuebingen.mpg.de/2015/slides/fergus/Fergus_1.pdf
Algorithmic Intelligence Laboratory https://vaaaaaanquish.hatenablog.com/entry/2015/01/26/060622 55
CNN: Visualization
• Visualization of CNN feature representations [Zeiler et al., 2014]

• VGG-16 [Simonyan et al., 2015]

CNN in Computer Vision: Everywhere
Classification and retrieval [Krizhevsky et al., 2012]

Detection [Ren et al., 2015] Segmentation [Farabet et al., 2013]

Self-driving cars Human pose estimation [Cae et al., 2017]
Image captioning [Vinyals et al., 2015][Karpathy et al., 2015]

Table of Contents

• Basics

• Basics

• Basics
4. Question
• Why is it difficult to train a deep neural network ?

RNN: Basics
• CNN models spatial invariance information
• Recurrent Neural Network (RNN)

• Models temporal information
• Hidden states as a function of inputs and previous time step
information
• Temporal information is important in many applications

• Language
• Speech
• Video

RNN: Basics
• Process a sequence of vectors by applying

recurrence formula at every time step :
New state Old state Input

vector at
time step t
Function parameterized by e.g, DNN, CNN

RNN: Basics
• Process a sequence of vectors by applying

recurrence formula at every time step :
• Same function and the same set of parameters

are used at every time step

RNN: Vanilla RNN
• Simple RNN
• The state consists of a single “hidden” vector
• Vanilla RNN (or sometimes called Elman RNN)

RNN: Computation Graph
Re-use the same weight matrix at every time step

RNN: Computation Graph (Many to Many)
e.g., Machine Translation

(Sequence of words à Sequence of words)

RNN: Computation Graph (Many to One)
e.g., Sentiment Classification

(Sequence of words à sentiment) Good paper or not?

RNN: Computation Graph (One to Many)
e.g., Image Captioning

(Image à sequence of words)

RNN: An Example
• Character-level
language model
• Vocabulary : [h,e,l,o]
• Example training
sequence : “hello” 0.3 1.0 0.1 -0.3
Hidden layer -0.1 0.3 -0.5 0.9
0.9 0.1 -0.3 0.7
1 0 0 0
0 1 0 0
Input layer
0 0 1 1
0 0 0 0
Input chars: “h” “e” “l” “l”

RNN: An Example
• Character-level Target chars: “e” “l” “l” “o”

language model
1.0 0.5 0.1 0.2
2.2 0.3 0.5 -1.5
Output layer
• Vocabulary : [h,e,l,o] -3.0 -1.0 1.9 -0.1
4.1 1.2 -1.1 2.2
• Example training
sequence : “hello” 0.3 1.0 0.1 -0.3
Hidden layer -0.1 0.3 -0.5 0.9
0.9 0.1 -0.3 0.7
1 0 0 0
0 1 0 0
Input layer
0 0 1 1
0 0 0 0

RNN: An Example
• Character-level Samples: “e” “l” “l” “o”

language model .03 .25 .11 .11
.13 .20 .17 .02
Softmax
.00 .05 .68 .08
• Vocabulary : [h,e,l,o] .84 .50 .03 .79
1.0 0.5 0.1 0.2

Output layer 2.2 0.3 0.5 -1.5
• At test time, sample -3.0 -1.0 1.9 -0.1
character one at a 4.1 1.2 -1.1 2.2
time and feedback to
model 0.3 1.0 0.1 -0.3
Hidden layer -0.1 0.3 -0.5 0.9
0.9 0.1 -0.3 0.7
1 0 0 0
0 1 0 0
Input layer
0 0 1 1
0 0 0 0

RNN: Backpropagation Through Time (BPTT)
• Backpropagation through time (BPTT)

• Forward through entire sequence to compute loss, then backward through
entire sequence to compute gradient

Contents

• Basics

• Basics

• Basics
4. Question

Question
• Can we just simply stack multiple layers and train them all?
• Unfortunately, it does not work well
• Even if we have infinite amount of computational resource
Vanishing gradient problem :

• The magnitude of the gradients shrink exponentially as we backpropagate through
many layers
• Since typical activation functions such as sigmoid or tanh are bounded
• The phenomenon is called vanishing gradient problem

Vanishing Gradient Problem
• Why do gradients vanish?

• Think of a simplified 3-layer neural network


• First, let’s update

• Calculate the gradient of the loss with respect to


• How about ? Gradients < 1

• Calculate the gradient of the loss with respect to
Keep multiplying values < 1 will

decrease the amount exponentially
Vanishing Gradient Over Time
• This is more problematic in vanilla RNN (with tanh/sigmoid activation)

• When trying to handle long temporal dependency
• Similar to previous example, the gradient vanishes over time
Algorithmic Intelligence Laboratory *source :https://mediatum.ub.tum.de/doc/673554/file.pdf 78

Quiz
• Vanishing gradient problem is critical in training neural network

• Q: Can we just use activation function that has gradients > 1?

Answer for Quiz
• Not really. It will cause another problem so called exploding gradients.

• Let’s consider if we use exponential activation function:
• The magnitude of gradient is always larger than 1 when input > 0
• If output of the networks are positive, then the gradients to update will explode
Gradients > 1
• This will cause the training very unstable

• The weights will be updated in very large amount, resulting in NaN values
• Very critical problem in training neural networks

How Can We Overcome Vanishing Gradient Problems?
• Possible solutions
• Activation functions
• CNN: Residual networks [He et al., 2016]
• RNN: LSTM (Long Short-Term Memory)
LSTM (Long Short-Term Memory)
*source
https://mediatum.ub.tum.de/doc/673554/file.pdf
Algorithmic Intelligence Laboratory https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092 81
Solving Vanishing Gradient: Activation Functions
• Use different activation functions that are not bounded:

• Recent works largely use ReLU or their variants
• No saturation, easy to optimize
*source: https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092
Solving Vanishing Gradient: Activation Functions
• Several generalizations of ReLU

• Leaky ReLU [Maas et. al., 2013]: Introducing non-zero gradient for ‘dying ReLUs’
• Parameteric ReLU (PReLU) [He et al., 2015]: Additional learnable parameter 𝑎 on
leaky ReLUs.
• Randomized ReLU (RReLU) [Xu et al., 2015]: Samples parameter 𝑎 from uniform
distribution. output
leaky ReLU
𝑓 𝑥 = max(0.01𝑥, 𝑥)
input
PReLU, RReLU
𝑓 𝑥 = max(𝑥/𝑎, 𝑥)
• Concatenated ReLU (CReLU) [Shang et al., 2016] output
• ‘Opposite pairs’ of filters found in CNN

CReLU
- Needs to learn twice the information 𝑓 𝑥 = max(−𝑥, 𝑥)
• Two-sided ReLU input

Solving Vanishing Gradient: Residual Networks
• Residual networks (ResNet [He et al., 2016])

• Feed-forward NN with “shortcut connections”
• Can preserve gradient flow throughout the entire depth of the network
• Possible to train more than 100 layers by simply stacking residual blocks
Plain network Residual network

Solving Vanishing Gradient: LSTM and GRU
• LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units)

• Specially designed RNN which can remember information for much longer period
3 main steps:
• Forget irrelevant parts of previous state
• Selectively update the cell state based on the
new input
• Selectively decide what part of the cell state to
output as the new hidden state
Preservation of gradient information in LSTM

*source :
http://harinisuresh.com/2016/10/09/lstms/
Algorithmic Intelligence Laboratory https://mediatum.ub.tum.de/doc/673554/file.pdf 85
References
• [Nair et al., 2010] "Rectified linear units improve restricted boltzmann machines." ICML 2010.
link : https://dl.acm.org/citation.cfm?id=3104425
• [Krizhevsky et al., 2012] "Imagenet classification with deep convolutional neural networks." NIPS 2012
link : https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
• [Maas et al., 2013] "Rectifier nonlinearities improve neural network acoustic models." ICML 2013.
link : https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf
• [Farabet et al., 2013] "Learning hierarchical features for scene labeling." IEEE transactions on PAMI 2013
link : https://www.ncbi.nlm.nih.gov/pubmed/23787344
• [Zeiler et al., 2014] "Visualizing and understanding convolutional networks." ECCV 2014.
link : https://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
• [Simonyan et al., 2015] "Very deep convolutional networks for large-scale image recognition.” ICLR 2015
link : https://arxiv.org/abs/1409.1556
• [Ren et al., 2015] "Faster r-cnn: Towards real-time object detection with region proposal networks." NIPS 2015
• [Vinyals et al., 2015] "Show and tell: A neural image caption generator." CVPR 2015.
• [Karpathy et al., 2015] "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
link : https://cs.stanford.edu/people/karpathy/cvpr2015.pdf
• [He et al., 2015] "Delving deep into rectifiers: Surpassing human-level performance on imagenet
classification." ICCV 2015.

References
• [Xu et al., 2015] "Empirical evaluation of rectified activations in convolutional network." arXiv preprint, 2015.
• [Shang et al., 2016] "Understanding and improving convolutional neural networks via concatenated rectified
linear units." ICML 2016.
• [He et al., 2016] "Deep residual learning for image recognition." CVPR 2016
• [Cae et al., 2017] "Realtime multi-person 2d pose estimation using part affinity fields.", CVPR 2017
• [Fei-Fei and Karpathy, 2017] “CS231n: Convolutional Neural Networks for Visual Recognition”, 2017. (Stanford
University)
link : http://cs231n.stanford.edu/2017/

2 DNN-CNN-RNN

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

2 DNN-CNN-RNN

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 DNN-CNN-RNN

Uploaded by

Copyright:

Available Formats

Algorithmic

Introduction to Neural Networks:

EE807: Recent Advances in Deep Learning

Algorithmic Intelligence Laboratory

• Machine Learning = Build an algorithm from data

Algorithmic Intelligence Laboratory

• An algorithm that learns multiple levels of abstractions in data

Deep & Large Networks

Input Feature Extraction Other ML Output

Input Deep Network Output

Algorithmic Intelligence Laboratory 4

1. Deep Neural Networks (DNN)

2. Convolutional Neural Networks (CNN)

3. Recurrent Neural Networks (RNN)

Algorithmic Intelligence Laboratory 5

1. Deep Neural Networks (DNN)

2. Convolutional Neural Networks (CNN)

3. Recurrent Neural Networks (RNN)

Algorithmic Intelligence Laboratory 6

• Human brain is made up of 100 billion neurons

Algorithmic Intelligence Laboratory *source : https://pt.slideshare.net/hammawan/deep-neural-networks 7

• Artificial neural networks

Output / activation of the neuron

Algorithmic Intelligence Laboratory 8

Algorithmic Intelligence Laboratory *source : http://mt-class.org/jhu/slides/lecture-nn-intro.pdf 9

• Deep neural networks

Nonlinear Inputs Outputs

“2-layer Neural Net”

Algorithmic Intelligence Laboratory 10

• Neural network parameterized by

Next, forward propagation

• Forward propagation: calculate the output of the neural network

where is activation function (e.g., sigmoid function) and is number of layers

Algorithmic Intelligence Laboratory 12

Algorithmic Intelligence Laboratory 13

Algorithmic Intelligence Laboratory 14

• Compute hidden units

Algorithmic Intelligence Laboratory 15

Next, training objective

where is a loss function e.g., MSE(Mean square error) or cross entropy

Next, how to optimize ?

• Gradient descent (GD) updates parameters iteratively to the gradient direction.

Next, backpropagation in details

• Consider the input

Algorithmic Intelligence Laboratory 19

• Consider the input

Algorithmic Intelligence Laboratory 20

• Consider the input

• Compute derivative of with respect to

Algorithmic Intelligence Laboratory 21

• Consider the input

• Compute derivative of with respect to

Algorithmic Intelligence Laboratory 22

• Consider the input

• Compute derivative of with respect to

• Parameter update rule learning rate

Algorithmic Intelligence Laboratory 23

• Consider the input