Deep Learning RNN

Deep Learning
B.Tech (B)
Ms. Punam R. Patil
Vision of the Department:
To provide prominent computer engineering education with socio-moral values.
Mission of the Department:

1. To provide state-of-the-art ICT based teaching-learning process.
2. To groom the students to become professionally sound computer engineers to
meet growing needs of industry and society.
3. To make the students responsible human being by inculcating ethical values
Program Educational Objectives(PEOs)

PEO1: To provide the foundation of lifelong learning skills for advancing their
careers being a professional entrepreneur and leader
PEO2: To develop computer professionals to fulfill Industry expectations
PEO3: To foster ethical and social values to be socially responsible human being.
2
Deep Learning (PECO7031T)
Teaching Scheme Examination Scheme

Lectures : 03 Hrs./week Term Test : 15 Marks
Credits : 03 Teacher Assessment : 20 Marks
End Sem Exam : 65 Marks
Total Marks : 100 Marks
3
Course Objectives:
Prerequisite: Artificial Intelligence, Machine Learning.
Course Objectives:
• To understand Hyper parameter Tuning.
• To explore Deep Learning Techniques with different learning
strategies.
• To design Deep Learning Models for real time applications.
4
Course Outcomes
5
Unit IV: Recurrent Neural Networks
10 Hrs
• Introduction to Sequence Models and RNNs, Recurrent Neural
Network Model, Backpropagation Through Time
• Different Types of RNNs: Unfolded RNNs, Seq2Seq RNNs, Long
Short-Term Memory (LSTM), Bidirectional RNN, Vanishing
Gradients with RNNs, Gated Recurrent Unit (GRU),
• RNN applications.
6
Sequence Models
• Sequence models are the machine learning models that input or
output sequences of data.
• Sequential data includes text streams, audio clips, video clips,
time-series data and etc.
• Recurrent Neural Networks (RNNs) is a popular algorithm used in
sequence models.
7
Applications of Sequence Models
1. Speech recognition: In speech recognition, an audio clip is given as
an input and then the model has to generate its text transcript.
Here both the input and output are sequences of data.
https://www.codingninjas.com/studio/library/sequence-models 8
2. Sentiment Classification: In sentiment classification opinions
expressed in a piece of text is categorized. Here the input is a
sequence of words.
As we know, the sequence of words changes the sentence's meaning.
So to maintain a sequence of words during classification, we use
sequence models.
3. Video Activity Recognition: In video activity recognition, the
model needs to identify the activity in a video clip. A video clip is a
sequence of video frames, therefore in case of video activity
recognition input is a sequence of data.
Recurrent Neural Networks (RNNs)
• Recurrent Neural Network(RNN) is a type of Neural Network where the output from
the previous step is fed as input to the current step.
• In traditional neural networks, all the inputs and outputs are independent of each
other, but in cases when it is required to predict the next word of a sentence, the
previous words are required and hence there is a need to remember the previous
words.
• Thus RNN came into existence, which solved this issue with the help of a Hidden Layer.
• The main and most important feature of RNN is its Hidden state, which remembers
some information about a sequence.
• The state is also referred to as Memory State since it remembers the previous input to
the network.
• It uses the same parameters for each input as it performs the same task on all the
inputs or hidden layers to produce the output.
• This reduces the complexity of parameters, unlike other neural networks. 11
12
• RNN is a special neural network suited for sequential (or recurrent) data.
• Examples of sequential data include:
1. Sentences (sequences of words).
2. Time series (sequences of stock prices, for instance).
3. Videos (sequences of frames).
• RNNs are mostly used in the field of Natural Language Processing (NLP). RNN
maintains internal memory, due to this they are very efficient for machine learning
problems that involve sequential data. RNNs are also used in time series predictions as
well.
• The main advantage of using RNNs instead of standard neural networks is that the
features are not shared in standard neural networks. Weights are shared across time
in RNN. RNNs can remember its previous inputs but Standard Neural Networks are
not capable of remembering previous inputs. RNN takes historical information for
computation.
13
• In RNN, the loss function is defined based on the loss at each time
step.
• Backpropagation is done at each point in time in RNN.
14
Backpropagation Through Time-RNN
• Backpropagation is a training algorithm that we use for training

neural networks. When preparing a neural network, we are tuning
the network's weights to minimize the error concerning the
available actual values with the help of the Backpropagation
algorithm. Backpropagation is a supervised learning algorithm as we
find errors concerning already given values.
• The backpropagation training algorithm aims to modify the weights
of a neural network to minimize the error of the network results
compared to some expected output in response to corresponding
inputs.
15
Backpropagation Through Time-RNN
• The general algorithm of Backpropagation is as follows:
1. We first train input data and propagate it through the network to get an
output.
2. Compare the predicted outcomes to the expected results and calculate the
error.
3. Then, we calculate the derivatives of the error concerning the network
weights.
4. We use these calculated derivatives to adjust the weights to minimize the
error.
5. Repeat the process until the error is minimized.
• In simple words, Backpropagation is an algorithm where the information of

cost function is passed on through the neural network in the backward
direction. The Backpropagation training algorithm is ideal for training feed-
forward neural networks on fixed-sized input-output pairs.
16
Long Short-Term Memory (LSTM)
• Long Short Term Memory is a kind of recurrent neural network. In
RNN output from the last step is fed as input in the current step.
• LSTM was designed by Hochreiter & Schmidhuber.
• It tackled the problem of long-term dependencies of RNN in which
the RNN cannot predict the word stored in the long-term memory but
can give more accurate predictions from the recent information.
• As the gap length increases RNN does not give an efficient
performance. LSTM can by default retain the information for a long
period of time. It is used for processing, predicting, and classifying on
the basis of time-series data.
17
LSTM- Long Short-Term Memory
 LSTM stands for long short-term memory networks, used in the field of Deep
Learning.
 It is a variety of recurrent neural networks (RNNs) that are capable of
learning long-term dependencies, especially in sequence prediction problems.
 LSTM has feedback connections, i.e., it is capable of processing the entire
sequence of data, apart from single data points such as images.
 LSTM is a special kind of RNN, which shows outstanding performance on a
large variety of problems.
 The central role of an LSTM model is held by a memory cell known as a ‘cell
state’ that maintains its state over time.
LSTM
 The cell state is the horizontal line that runs through the top of the below diagram.
 It can be visualized as a conveyor belt through which information just flows, unchanged.
 Information can be added to or removed from the cell state in LSTM and is regulated by gates.
 These gates optionally let the information flow in and out of the cell.
 It contains a point-wise multiplication operation and a sigmoid neural net layer that assist the
mechanism.
 The sigmoid layer gives out numbers between zero and one, where zero means ‘nothing should be
let through,’ and one means ‘everything should be let through.’
LSTM
Cell state
Output
Advantages of LSTM
• Long-term dependencies can be captured by LSTM networks. They

have a memory cell that is capable of long-term information storage.
• In traditional RNNs, there is a problem of vanishing and exploding
gradients when models are trained over long sequences. By using a
gating mechanism that selectively recalls or forgets information, LSTM
networks deal with this problem.
• LSTM enables the model to capture and remember the important
context, even when there is a significant time gap between relevant
events in the sequence. So where understanding context is important,
LSTMS are used. eg. machine translation.
21
Disadvantages of LSTM
• Compared to simpler architectures like feed-forward neural networks

LSTM networks are computationally more expensive. This can limit
their scalability for large-scale datasets or constrained environments.
• Training LSTM networks can be more time-consuming compared to
simpler models due to their computational complexity. So training
LSTMs often requires more data and longer training times to achieve
high performance.
• Since it is processed word by word in a sequential manner, it is hard
to parallelize the work of processing the sentences.
22
LSTM Applications
1. sentiment analysis
2. Language modeling,
3. Speech recognition,
4. machine translation and
5. Video analysis
Applications of LSTM includes:
• Long Short-Term Memory (LSTM) is a powerful type of Recurrent
Neural Network (RNN) that has been used in a wide range of
applications. Here are a few famous applications of LSTM:
• Language Modeling: LSTMs have been used for natural language
processing tasks such as language modeling, machine translation, and
text summarization. They can be trained to generate coherent and
grammatically correct sentences by learning the dependencies
between words in a sentence.
• Speech Recognition: LSTMs have been used for speech recognition
tasks such as transcribing speech to text and recognizing spoken
commands. They can be trained to recognize patterns in speech and
match them to the corresponding text.
24
Applications of LSTM includes:
• Time Series Forecasting: LSTMs have been used for time series forecasting tasks such
as predicting stock prices, weather, and energy consumption. They can learn patterns
in time series data and use them to make predictions about future events.
• Anomaly Detection: LSTMs have been used for anomaly detection tasks such as
detecting fraud and network intrusion. They can be trained to identify patterns in
data that deviate from the norm and flag them as potential anomalies.
• Recommender Systems: LSTMs have been used for recommendation tasks such as
recommending movies, music, and books. They can learn patterns in user behavior
and use them to make personalized recommendations.
• Video Analysis: LSTMs have been used for video analysis tasks such as object
detection, activity recognition, and action classification. They can be used in
combination with other neural network architectures, such as Convolutional Neural
Networks (CNNs), to analyze video data and extract useful information.
25
Bidirectional RNN
• BRNNs process input sequences in both the forward and backward directions.
This is the main distinction between BRNNs and conventional recurrent neural
networks.
• A BRNN has two distinct recurrent hidden layers, one of which processes the
input sequence forward and the other of which processes it backward. After that,
the results from these hidden layers are collected and input into a prediction-
making final layer. Any recurrent neural network cell, such as Long Short-Term
Memory (LSTM) or Gated Recurrent Unit, can be used to create the recurrent
hidden layers.
• The BRNN functions similarly to conventional recurrent neural networks in the
forward direction, updating the hidden state depending on the current input and
the prior hidden state at each time step. The backward hidden layer, on the other
hand, analyses the input sequence in the opposite manner, updating the hidden
state based on the current input and the hidden state of the next time step.
26
Bidirectional RNN
• Compared to conventional unidirectional recurrent neural networks,
the accuracy of the BRNN is improved since it can process information
in both directions and account for both past and future contexts.
Because the two hidden layers can complement one another and give
the final prediction layer more data, using two distinct hidden layers
also offers a type of model regularisation.
• In order to update the model parameters, the gradients are computed
for both the forward and backward passes of the backpropagation
through the time technique that is typically used to train BRNNs. The
input sequence is processed by the BRNN in a single forward pass at
inference time, and predictions are made based on the combined
outputs of the two hidden layers.
27
Bidirectional RNN
28
Working of Bidirectional Recurrent Neural Network
• Inputting a sequence: A sequence of data points, each represented as a vector with the same
dimensionality, are fed into a BRNN. The sequence might have different lengths.
• Dual Processing: Both the forward and backward directions are used to process the data. On the
basis of the input at that step and the hidden state at step t-1, the hidden state at time step t is
determined in the forward direction. The input at step t and the hidden state at step t+1 are used
to calculate the hidden state at step t in a reverse way.
• Computing the hidden state: A non-linear activation function on the weighted sum of the input
and previous hidden state is used to calculate the hidden state at each step. This creates a memory
mechanism that enables the network to remember data from earlier steps in the process.
• Determining the output: A non-linear activation function is used to determine the output at each
step from the weighted sum of the hidden state and a number of output weights. This output has
two options: it can be the final output or input for another layer in the network.
• Training: The network is trained through a supervised learning approach where the goal is to
minimize the discrepancy between the predicted output and the actual output. The network
adjusts its weights in the input-to-hidden and hidden-to-output connections during training
through backpropagation. 29
Bidirectional RNN
• To calculate the output from an RNN unit, we use the following formula:
Ht (Forward) = A(Xt * WXH (forward) + Ht-1 (Forward) * WHH (Forward) + bH (Forward)
Ht (Backward) = A(Xt * WXH (Backward) + Ht+1 (Backward) * WHH (Backward) + bH (Backward)
where,
A = activation function,
W = weight matrix
b = bias
• The hidden state at time t is given by a combination of Ht (Forward) and Ht (Backward). The output at any given
hidden state is : Yt = Ht * WAY + by
• The training of a BRNN is similar to backpropagation through a time algorithm. BPTT algorithm works as follows:
• Roll out the network and calculate errors at each iteration
• Update weights and roll up the network.
However, because forward and backward passes in a BRNN occur simultaneously, updating the weights for the two
processes may occur at the same time. This produces inaccurate outcomes. Thus, the following approach is used to
train a BRNN to accommodate forward and backward passes individually.
30
Advantages of Bidirectional RNN
• Context from both past and future: With the ability to process sequential input both
forward and backward, BRNNs provide a thorough grasp of the full context of a sequence.
Because of this, BRNNs are effective at tasks like sentiment analysis and speech
recognition.
• Enhanced accuracy: BRNNs frequently yield more precise answers since they take both
historical and upcoming data into account.
• Efficient handling of variable-length sequences: When compared to conventional RNNs,
which require padding to have a constant length, BRNNs are better equipped to handle
variable-length sequences.
• Resilience to noise and irrelevant information: BRNNs may be resistant to noise and
irrelevant data that are present in the data. This is so because both the forward and
backward paths offer useful information that supports the predictions made by the
network.
• Ability to handle sequential dependencies: BRNNs can capture long-term links between
sequence pieces, making them extremely adept at handling complicated sequential
dependencies. 31
Disadvantages of Bidirectional RNN
• Computational complexity: Given that they analyze data both forward and
backward, BRNNs can be computationally expensive due to the increased
amount of calculations needed.
• Long training time: BRNNs can also take a while to train because there are
many parameters to optimize, especially when using huge datasets.
• Difficulty in parallelization: Due to the requirement for sequential
processing in both the forward and backward directions, BRNNs can be
challenging to parallelize.
• Overfitting: BRNNs are prone to overfitting since they include many
parameters that might result in too complicated models, especially when
trained on short datasets.
• Interpretability: Due to the processing of data in both forward and
backward directions, BRNNs can be tricky to interpret since it can be
difficult to comprehend what the model is doing and how it is producing
predictions.
32
Advantages of Recurrent Neural Network
• Recurrent Neural Networks (RNNs) have several advantages over
other types of neural networks, including:
1. Ability To Handle Variable-Length Sequences
• RNNs are designed to handle input sequences of variable length,
which makes them well-suited for tasks such as speech
recognition, natural language processing, and time series analysis.
2. Memory of Past Inputs
• RNNs have a memory of past inputs, which allows them to capture
information about the context of the input sequence. This makes
them useful for tasks such as language modeling, where the
meaning of a word depends on the context in which it appears.
https://www.simplilearn.com/tutorials/deep-learning-tutorial/rnn
33
Gated Recurrent Unit (GRU)
• GRU stands for Gated Recurrent Unit, which is a type of recurrent

neural network (RNN) architecture that is similar to LSTM (Long
Short-Term Memory).
• Like LSTM, GRU is designed to model sequential data by allowing
information to be selectively remembered or forgotten over time.
However, GRU has a simpler architecture than LSTM, with fewer
parameters, which can make it easier to train and more
computationally efficient.
Understanding Gated Recurrent Unit (GRU) in Deep Learning | by Anishnama | Medium 34

• The main difference between GRU and LSTM is the way they handle the
memory cell state.
In LSTM, the memory cell state is maintained separately from the
hidden state and is updated using three gates: the input gate, output gate, and
forget gate.
In GRU, the memory cell state is replaced with a “candidate activation
vector,” which is updated using two gates: the reset gate and update gate.
• The reset gate determines how much of the previous hidden state to forget,
while the update gate determines how much of the candidate activation
vector to incorporate into the new hidden state.
• Overall, GRU is a popular alternative to LSTM for modeling sequential data,
especially in cases where computational resources are limited or where a
simpler architecture is desired.
How GRU Works?
• Like other recurrent neural network architectures, GRU processes sequential
data one element at a time, updating its hidden state based on the current
input and the previous hidden state.
• At each time step, the GRU computes a “candidate activation vector” that
combines information from the input and the previous hidden state.
• This candidate vector is then used to update the hidden state for the next
time step.
• The candidate activation vector is computed using two gates: the reset gate
and the update gate.
• The reset gate determines how much of the previous hidden state to forget,
while the update gate determines how much of the candidate activation
vector to incorporate into the new hidden state.

37
How GRU Works?
1. The reset gate r and update gate z are computed using the current input x and the previous
hidden state h_t-1
r_t = sigmoid(W_r * [h_t-1, x_t])
z_t = sigmoid(W_z * [h_t-1, x_t])
where W_r and W_z are weight matrices that are learned during training.
2. The candidate activation vector h_t~ is computed using the current input x and a modified
version of the previous hidden state that is "reset" by the reset gate:
h_t~ = tanh(W_h * [r_t * h_t-1, x_t])
where W_h is another weight matrix.
3. The new hidden state h_t is computed by combining the candidate activation vector with the
previous hidden state, weighted by the update gate:
h_t = (1 - z_t) * h_t-1 + z_t * h_t~
How GRU Works?
• Overall, the reset gate determines how much of the previous
hidden state to remember or forget, while the update gate
determines how much of the candidate activation vector to
incorporate into the new hidden state. The result is a compact
architecture that is able to selectively update its hidden state
based on the input and previous hidden state, without the need for
a separate memory cell state like in LSTM.

GRU Architecture
The GRU architecture consists of the following components:
• Input layer: The input layer takes in sequential data, such as a sequence of
words or a time series of values, and feeds it into the GRU.
• Hidden layer: The hidden layer is where the recurrent computation occurs.
At each time step, the hidden state is updated based on the current input and
the previous hidden state. The hidden state is a vector of numbers that
represents the network’s “memory” of the previous inputs.
• Reset gate: The reset gate determines how much of the previous hidden
state to forget. It takes as input the previous hidden state and the current
input, and produces a vector of numbers between 0 and 1 that controls the
degree to which the previous hidden state is “reset” at the current time step.

GRU Architecture
• Update gate: The update gate determines how much of the candidate
activation vector to incorporate into the new hidden state. It takes as input
the previous hidden state and the current input, and produces a vector of
numbers between 0 and 1 that controls the degree to which the candidate
activation vector is incorporated into the new hidden state.
• Candidate activation vector: The candidate activation vector is a modified
version of the previous hidden state that is “reset” by the reset gate and
combined with the current input. It is computed using a tanh activation
function that squashes its output between -1 and 1.
• Output layer: The output layer takes the final hidden state as input and
produces the network’s output. This could be a single number, a sequence of
numbers, or a probability distribution over classes, depending on the task at
hand.
Advantages of GRU
1. GRU networks are similar to Long Short-Term Memory (LSTM)
networks, but with fewer parameters, making them
computationally less expensive and faster to train.
2. GRU networks can handle long-term dependencies in sequential
data by selectively remembering and forgetting previous inputs.
3. GRU networks have been shown to perform well on a variety of
tasks, including natural language processing, speech recognition,
and music generation.
4. GRU networks can be used for both sequence-to-sequence and
sequence classification tasks.
Disadvantages of GRU
1. GRU networks may not perform as well as LSTMs on tasks that
require modeling very long-term dependencies or complex
sequential patterns.
2. GRU networks may be more prone to overfitting than LSTMs,
especially on smaller datasets.
3. GRU networks require careful tuning of hyperparameters, such
as the number of hidden units and learning rate, to achieve good
performance.
4. GRU networks may not be as interpretable as other machine
learning models, since the gating mechanism can make it
difficult to understand how the network is making predictions.
Types of Recurrent Neural Networks
• There are four types of Recurrent Neural Networks:

• One to One
• One to Many
• Many to One
• Many to Many
44
One to One RNN
• This type of neural network is known as the

Vanilla Neural Network. It's used for general
machine learning problems, which has a
single input and a single output.
45
One to Many RNN
• This type of neural network has a single input and multiple outputs.
An example of this is the image caption.
46
Many to One RNN
• This RNN takes a sequence of inputs and

generates a single output. Sentiment
analysis is a good example of this kind of
network where a given sentence can be
classified as expressing positive or
negative sentiments.
47
Many to Many RNN
• This RNN takes a sequence of inputs and

generates a sequence of outputs. Machine
translation is one of the examples.
48
3. Parameter Sharing
• RNNs share the same set of parameters across all time steps,
which reduces the number of parameters that need to be learned
and can lead to better generalization.
4. Non-Linear Mapping
• RNNs use non-linear activation functions, which allows them to
learn complex, non-linear mappings between inputs and outputs.
5. Sequential Processing
• RNNs process input sequences sequentially, which makes them
computationally efficient and easy to parallelize.
49
6.Flexibility
• RNNs can be adapted to a wide range of tasks and input types, including
text, speech, and image sequences.
7. Improved Accuracy
• RNNs have been shown to achieve state-of-the-art performance on a
variety of sequence modeling tasks, including language modeling, speech
recognition, and machine translation.
These advantages make RNNs a powerful tool for sequence modeling

and analysis, and have led to their widespread use in a variety of
applications, including natural language processing, speech recognition,
and time series analysis.
50
Disadvantages of Recurrent Neural Network
• Although Recurrent Neural Networks (RNNs) have several advantages,
they also have some disadvantages. Here are some of the main
disadvantages of RNNs:
1. Vanishing And Exploding Gradients
• RNNs can suffer from the problem of vanishing or exploding gradients,
which can make it difficult to train the network effectively. This occurs
when the gradients of the loss function with respect to the parameters
become very small or very large as they propagate through time.
2. Computational Complexity
• RNNs can be computationally expensive to train, especially when dealing
with long sequences. This is because the network has to process each
input in sequence, which can be slow.
51
3. Difficulty In Capturing Long-Term Dependencies
• Although RNNs are designed to capture information about past
inputs, they can struggle to capture long-term dependencies in the
input sequence. This is because the gradients can become very small
as they propagate through time, which can cause the network to
forget important information.
4. Lack Of Parallelism
• RNNs are inherently sequential, which makes it difficult to parallelize
the computation. This can limit the speed and scalability of the
network.
52
5. Difficulty In Choosing The Right Architecture
• There are many different variants of RNNs, each with its own advantages
and disadvantages. Choosing the right architecture for a given task can be
challenging, and may require extensive experimentation and tuning.
6. Difficulty In Interpreting The Output
• The output of an RNN can be difficult to interpret, especially when
dealing with complex inputs such as natural language or audio. This can
make it difficult to understand how the network is making its
predictions.
These disadvantages are important when deciding whether to use

an RNN for a given task. However, many of these issues can be addressed
through careful design and training of the network and through techniques
such as regularization and attention mechanisms.
53

Deep Learning RNN

Uploaded by

Copyright:

Available Formats

Deep Learning RNN

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning RNN

Uploaded by

Copyright:

Available Formats

Deep Learning

Mission of the Department:

Program Educational Objectives(PEOs)

Teaching Scheme Examination Scheme

• Backpropagation is done at each point in time in RNN.

• Backpropagation is a training algorithm that we use for training

• In simple words, Backpropagation is an algorithm where the information of

• Long-term dependencies can be captured by LSTM networks. They

• Compared to simpler architectures like feed-forward neural networks

• GRU stands for Gated Recurrent Unit, which is a type of recurrent

Understanding Gated Recurrent Unit (GRU) in Deep Learning | by Anishnama | Medium 34

Understanding Gated Recurrent Unit (GRU) in Deep Learning | by Anishnama | Medium 36

Understanding Gated Recurrent Unit (GRU) in Deep Learning | by Anishnama | Medium 39

Understanding Gated Recurrent Unit (GRU) in Deep Learning | by Anishnama | Medium 40

• There are four types of Recurrent Neural Networks:

• This type of neural network is known as the

• This RNN takes a sequence of inputs and

• This RNN takes a sequence of inputs and

These advantages make RNNs a powerful tool for sequence modeling

These disadvantages are important when deciding whether to use

You might also like