DL Unit-2
DL Unit-2
DL Unit-2
UNIT II: Training Neural Network: Risk minimization, loss function, back propagation,
regularization, model selection, and optimization.
Conditional Random Fields: Linear chain, partition function, Markov network, Belief
propagation, Training CRFs, Hidden Markov Model, Entropy.
Get your hands on as large a dataset as possible(DNNs are quite data-hungry: more is
better)
Remove any training sample with corrupted data (short texts, highly distorted images,
spurious output labels, features with lots of null values, etc.)
Data Augmentation - create new examples (in case of images - rescale, add noise, etc.)
1
we propagate down the model until they eventually explode, and this is what we call the
problem of exploding gradient.
Number of Hidden Units and Layers:
Keeping a larger number of hidden units than the optimal number, is generally a safe bet.
Since, any regularization method will take care of superfluous units, at least to some extent. On the
other hand, while keeping smaller numbers of hidden units (than the optimal number), there are
higher chances of under fitting the model.
Weight Initialization:
Always initialize the weights with small random numbers to break the symmetry between
different units. If weights are initialized to very large numbers, then the sigmoid will saturate(tail
regions), resulting into dead neurons.
If weights are very small, then gradients will also be small. Therefore, it’s preferable to
choose weights in an intermediate range, such that these are distributed evenly around a mean
value.
Learning Rates:
Set the learning rate too small and your model might take ages to converge, make it too
large and within initial few training examples, your loss might shoot up to sky. Generally, a
learning rate of 0.01 is a safe bet, but this shouldn’t be taken as a stringent rule; since the optimal
learning rate should be in accordance to the specific task.
Risk minimization:
If you compute the loss using the data points in our dataset, it's called empirical risk. It's
“empirical” and not “true” because we are using a dataset that's a subset of the whole population. ...
This process of finding this function is called empirical risk minimization. Ideally, we would like to
minimize the true risk
The empirical error is also sometimes called the generalization error. The reason is that
actually, in most problems, we don't have access to the whole domain X of inputs, but only our
training subset S. We want to generalize based on S, also called inductive learning.
The goal of a machine learning algorithm is to reduce the expected generalization error given by
equation below, This quantity is known as the risk.
We emphasize here that the expectation is taken over the true underlying distribution Pdata. If we
knew the true distribution Pdata(x, y), risk minimization would be an optimization task solvable by
an optimization algorithm. However, when we do not know Pdata (x, y) but only have a training set
of samples, we have a machine learning problem.
The simplest way to convert a machine learning problem back into an optimization problem
is to minimize the expected loss on the training set. This means replacing the true distribution p(x,
2
y) with the empirical distribution ˆ p(x, y) defined by the training set. We now minimize the
empirical risk
https://www.youtube.com/watch?v=BqzgUnrNhFM
Bias: variance in training data
Variance: variance in test data
3
Loss function:
The function we want to minimize or maximize is called the objective function or criterion.
When we are minimizing it, we may also call it the cost function, loss function, or error function.
The Loss Function is one of the important components of Neural Networks. Loss is nothing
but a prediction error of Neural Net. And the method to calculate the loss is called Loss Function. In
simple words, the Loss is used to calculate the gradients.
4
Neural Network uses optimizing strategies like stochastic gradient descent to minimize the
error in the algorithm. The way we actually compute this error is by using a Loss Function. It is
used to quantify how good or bad the model is performing. These are divided into two categories
i.e. Regression and Classification Loss.
Regression Loss:
Regression Loss is used when we are predicting continuous values like the price of a house or sales
of a company.
Mean Squared Error is the mean of squared differences between the actual and predicted value.
If the difference is large the model will penalize it as we are computing the squared difference.
Suppose we want to reduce the difference between the actual and predicted variable we can take the
natural logarithm of the predicted variable then take the mean squared error. This will overcome the
problem possessed by the Mean Square Error Method. The model will now penalize less in
comparison to the earlier method.
Sometimes there may be some data points which far away from rest of the points i.e outliers, in case
of cases Mean Absolute Error Loss will be appropriate to use as it calculates the average of the
absolute difference between the actual and predicted values.
Binary Classification Loss Function
Suppose we are dealing with a Yes/No situation like “a person has diabetes or not”, in this kind of
scenario Binary Classification Loss Function is used.
It gives the probability value between 0 and 1 for a classification task. Cross-Entropy calculates the
average difference between the predicted and actual probabilities.
Hinge Loss
This type of loss is used when the target variable has 1 or -1 as class labels. It penalizes the model
when there is a difference in the sign between the actual and predicted class values. Used in SVM
Models.
Multi-Class Classification Loss Function
if we take a dataset like Iris where we need to predict the three-class labels: Setosa, Versicolor and
Virginia, in such cases where the target variable has more than two classes Multi-Class
Classification Loss function is used.
5
Gradient descent
Gradient descent is an optimization algorithm that's used when training a machine learning model.
It's based on a convex function and tweaks its parameters iteratively to minimize a given function to
its local minimum
When you consider all the data points derivative of loss wrt derivative of old then it is GD
If it is done for 1 point then stochastic gradient decent (SGTD)
If it is K points where (K<N), N represent s all the data points then it is Mini Batch SGD
6
Back propagation:
Back-propagation is just a way of propagating the total loss back into the neural network to
know how much of the loss every node is responsible for, and subsequently updating the weights in
such a way that minimizes the loss by giving the nodes with higher error rates lower weights and
vice versa.
Backpropagation is the essence of neural network training. It is the method of fine-tuning
the weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce error rates and make the model reliable
by increasing its generalization.
The Back propagation algorithm in neural network computes the gradient of the loss
function for a single weight by the chain rule. It efficiently computes one layer at a time, unlike a
native direct computation. It computes the gradient, but it does not define how the gradient is used.
It generalizes the computation in the delta rule.
Static Back-propagation
It is one kind of backpropagation network which produces a mapping of a static input for
static output. It is useful to solve static classification issues like optical character
recognition.
Recurrent Backpropagation
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved.
After that, the error is computed and propagated backward.
Backpropagation is especially useful for deep neural networks working on error-prone projects,
such as image or speech recognition.
Regularization:
One of the most common problems data science professionals face is to avoid overfitting.
Have you come across a situation where your model performed exceptionally well on train data but
was not able to predict test data.
Avoiding overfitting can single-handedly improve our model’s performance. we will
understand the concept of overfitting and how regularization helps in overcoming the same
problem.
Regularization:
Regularization is a technique used in machine learning and deep learning to prevent
overfitting and improve the generalization performance of a model. It involves adding a penalty
term to the loss function during training.
This penalty discourages the model from becoming too complex or having large parameter
values, which helps in controlling the model’s ability to fit noise in the training data. Regularization
methods include L1 and L2 regularization, dropout, early stopping, and more. By applying
regularization, models become more robust and better at making accurate predictions on unseen
data.
8
Before we deep dive into the topic, take a look at this image:
Have you seen this image before? As we move towards the right in this image, our model tries to
learn too well the details and the noise from the training data, which ultimately results in poor
performance on the unseen data.
In other words, while going towards the right, the complexity of the model increases such that the
training error reduces but the testing error doesn’t. This is shown in the image below.
if you’ve built a neural network before, you know how complex they are. This makes them more
prone to overfitting.
9
How does Regularization help reduce Overfitting?
Let’s consider a neural network which is overfitting on the training data as shown in the image
below.
If you have studied the concept of regularization in machine learning, you will have a fair idea
that regularization penalizes the coefficients. In deep learning, it actually penalizes the weight
matrices of the nodes.
Assume that our regularization coefficient is so high that some of the weight matrices are nearly
equal to zero.
This will result in a much simpler linear network and slight underfitting of the training data.
Such a large value of the regularization coefficient is not that useful. We need to optimize the value
of regularization coefficient in order to obtain a well-fitted model as shown in the image below.
10
Different Regularization Techniques in Deep Learning
Now that we have an understanding of how regularization helps in reducing overfitting, we’ll learn
a few different techniques in order to apply regularization in deep learning.
L2 & L1 regularization
L1 and L2 are the most common types of regularization. These update the general cost function by
adding another term known as the regularization term.
Cost function = Loss (say, binary cross entropy) + Regularization term
Due to the addition of this regularization term, the values of weight matrices decrease because it
assumes that a neural network with smaller weight matrices leads to simpler models. Therefore, it
will also reduce overfitting to quite an extent.
In L2, we have:
Here, lambda is the regularization parameter. It is the hyperparameter whose value is optimized for
better results. L2 regularization is also known as weight decay as it forces the weights to decay
towards zero (but not exactly zero).
In L1, we have:
In this, we penalize the absolute value of the weights. Unlike L2, the weights may be reduced to
zero here. Hence, it is very useful when we are trying to compress our model. Otherwise, we
usually prefer L2 over it.
Dropout
This is the one of the most interesting types of regularization techniques. It also produces very good
results and is consequently the most frequently used regularization technique in the field of deep
learning.
To understand dropout, let’s say our neural network structure is akin to the one shown below:
11
So what does dropout do? At every iteration, it randomly selects some nodes and removes them
along with all of their incoming and outgoing connections as shown below.
So each iteration has a different set of nodes and this results in a different set of outputs. It can also
be thought of as an ensemble technique in machine learning.
Data Augmentation:
The simplest way to reduce overfitting is to increase the size of the training data. In machine
learning, we were not able to increase the size of training data as the labeled data was too costly.
But, now let’s consider we are dealing with images. In this case, there are a few ways of increasing
the size of the training data – rotating the image, flipping, scaling, shifting, etc. In the below image,
some transformation has been done on the handwritten digits dataset.
This technique is known as data augmentation. This usually provides a big leap in improving the
accuracy of the model. It can be considered as a mandatory trick in order to improve our
predictions.
Early stopping
Early stopping is a kind of cross-validation strategy where we keep one part of the training set as
the validation set. When we see that the performance on the validation set is getting worse, we
immediately stop the training on the model. This is known as early stopping.
12
In the above image, we will stop training at the dotted line since after that our model will start
overfitting on the training data.
Model selection:
Model selection is the process of selecting one final machine learning model from among a
collection of candidate machine learning models for a training dataset. Model selection is the
process of choosing one of the models as the final model that addresses the problem.
Model selection is a process that can be applied both across different types of models (e.g.
logistic regression, SVM, KNN, etc.) and across models of the same type configured with different
model hyper parameters (e.g. different kernels in an SVM).
All models have some predictive error, given the statistical noise in the data, the
incompleteness of the data sample, and the limitations of each different model type. Therefore, the
notion of a perfect or best model is not useful. Instead, we must seek a model that is “good enough.”
Therefore, a “good enough” model may refer to many things and is specific to your project, such as:
Some algorithms require specialized data preparation in order to best expose the structure of the
problem to the learning algorithm. Therefore, we must go one step further and consider model
selection as the process of selecting among model development pipelines.
Each pipeline may take in the same raw training dataset and outputs a model that can be
evaluated in the same manner but may require different or overlapping computational steps, such
as:
Data filtering.
Data transformation.
Feature selection.
Feature engineering.
And more…
The best approach to model selection requires “sufficient” data, which may be nearly infinite
depending on the complexity of the problem.
13
In this ideal situation, we would split the data into training, validation, and test sets, then fit
candidate models on the training set, evaluate and select them on the validation set, and report the
performance of the final model on the test set.
This is impractical on most predictive modeling problems given that we rarely have sufficient data,
or are able to even judge what would be sufficient.
there are two main classes of techniques to approximate the ideal case of model selection; they are:
Optimization:
The process of minimizing (or maximizing) any mathematical expression is called
optimization. Optimizers are algorithms or methods used to change the attributes of the neural
network such as weights and learning rate to reduce the losses. Optimizers are used to solve
optimization problems by minimizing the function.
Similarly, it’s impossible to know what your model’s weights should be right from the start.
But with some trial and error based on the loss function (whether the hiker is descending), you can
end up getting there eventually.
How you should change your weights or learning rates of your neural network to reduce the
losses is defined by the optimizers you use. Optimization algorithms are responsible for reducing
the losses and to provide the most accurate results possible.
We’ll learn about different types of optimizers and how they exactly work to minimize the loss
function.
1) Gradient Descent – Covered above
2) Stochastic Gradient Descent (SGD) - Covered above
3) Mini Batch Stochastic Gradient Descent (MB-SGD) - Covered above
4) SGD with momentum
14
SGD with momentum overcomes this disadvantage by denoising the gradients. Updates of
weight are dependent on noisy derivative and if we somehow denoise the derivatives then
converging time will decrease.
15
Conditional Random Fields: Linear chain, partition function, Markov network,
Belief propagation, Training CRFs, Hidden Markov Model, Entropy.
Conditional random fields (CRFs) are a class of statistical modeling methods often applied
in pattern recognition and machine learning and used for structured prediction. Whereas a classifier
predicts a label for a single sample without considering "neighboring" samples, a CRF can take
context into account.
1. model training, learning the conditional distributions between the Yi and feature functions
from some corpus of training data.
Linear chain conditional random fields (LC-CRFs) are a type of graphical model that is used
in machine learning to model sequential data. They are a variant of conditional random fields
16
(CRFs), which are a type of discriminative probabilistic model used for structured prediction. LC-
CRFs are used to model sequential data where the output is structured as a sequence, such
as natural language processing (NLP), speech recognition, and computer vision.
LC-CRFs are similar to hidden Markov models (HMMs), which are another type of graphical
model used for sequential data. However, LC-CRFs are more flexible than HMMs because they
allow for more complex features to be used in the model. In addition, LC-CRFs can be trained
using maximum likelihood estimation (MLE) or maximum a posteriori (MAP) estimation.
LC-CRFs are used in many applications, including part-of-speech tagging, named entity
recognition, chunking, and segmentation. They have also been used in deep learning models,
such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs), to
model sequential data.
partition function:
In the context of deep learning and conditional random fields (CRFs), the partition function
plays a crucial role in modeling and training sequential or structured prediction tasks. CRFs are a
type of probabilistic graphical model used for tasks like named entity recognition, part-of-speech
tagging, and other sequence labeling tasks.
The partition function in a CRF is also known as the normalization constant or the
marginalization term. It is a sum (or integral in the continuous case) of all possible labelings or
assignments to the variables in the CRF, and it ensures that the probabilities assigned by the CRF
model to all possible labelings sum up to 1.
Mathematically, for a sequence of random variables (nodes) X = {X_1, X_2, ..., X_n} and a set of
potential functions φ, the partition function Z is defined as:
The partition function Z is used during both training and inference. During training, it is
used to compute the likelihood of the observed data given the model parameters. Inference, which is
the process of predicting the most likely label sequence for a given input, often involves computing
the conditional probabilities of labels given the input features, and the partition function is used to
normalize these probabilities.
17
Overall, the partition function is a fundamental component of CRFs in deep learning,
ensuring that the model assigns valid probabilities to different label sequences and allowing for
effective training and inference in structured prediction tasks.
Markov network:
A Markov network, also known as a Markov random field or undirected graphical model, is a type of
probabilistic graphical model used in machine learning and deep learning for modeling complex, high-
dimensional probability distributions. Markov networks are particularly useful for capturing dependencies
between variables in a probabilistic way.
In the context of deep learning and conditional random fields (CRFs), Markov networks are often
used as a component of models for tasks like structured prediction, sequence labeling, and other tasks that
involve modeling dependencies between variables. Here's how Markov networks relate to conditional
random fields in deep learning:
Markov networks and CRFs are just one aspect of deep learning for structured data. They are useful for
modeling tasks where the output variables exhibit structured dependencies, and they can be integrated with
deep neural networks to leverage their representation learning capabilities. These models are often used in
tasks like semantic segmentation, named entity recognition, and sequence labeling, among others.
Belief propagation:
Belief Propagation (BP) is a message-passing algorithm commonly used in graphical models and
probabilistic graphical models, such as Bayesian networks and Markov random fields, to perform inference
18
and make predictions. In the context of deep learning, BP can also be applied to Conditional Random Fields
(CRFs) or Conditional Network Fields (CNFs), which are used for tasks like structured prediction, sequence
labeling, and image segmentation. Here's a brief overview of belief propagation in the context of CNFs in
deep learning:
CNFs are a type of probabilistic graphical model that extends the concept of Conditional Random
Fields (CRFs). In CNFs, you model dependencies between random variables in a structured way, while
taking into account the conditional dependencies given the observed data. This makes CNFs well-suited for
tasks where the output is not just a collection of independent labels but has structured relationships.
4) Learning in CNFs:
In the context of deep learning, CNFs can be used as part of a larger model, and they can be learned
from data along with other neural network components.
Learning in CNFs can involve training the model's parameters to maximize the likelihood of the
observed data, often through methods like maximum likelihood estimation or structured loss
functions.
Belief propagation in Conditional Network Fields is a powerful technique for structured prediction tasks
that involve modeling dependencies between variables. It leverages probabilistic graphical modeling and
message-passing algorithms to make predictions while considering the conditional dependencies in the data.
Training CRFs:
Training Conditional Random Fields (CRFs) in the context of deep learning involves combining the
advantages of deep neural networks with the structured prediction capabilities of CRFs. CRFs are often used
for tasks like sequence labeling, image segmentation, and other structured output prediction tasks. Here's an
overview of how to train CRFs in deep learning:
19
1) Data Preparation:
First, you need to prepare your data for training. This typically involves creating labeled datasets
where each input has a corresponding structured output. For example, in the case of part-of-
speech tagging, you would have sentences with labeled part-of-speech tags for each word.
2) Feature Extraction:
For each input, you need to extract features that can be used by your CRF model. These features
can be based on the input data and can be designed to capture relevant information for the
structured prediction task. In the context of deep learning, you might also consider using neural
network embeddings as features.
3) Design the CRF Model:
The CRF model is designed to capture dependencies between the structured output variables.
You can define a CRF in terms of the following components:
Unary Potentials: These represent the compatibility between each output label and the
input features. In deep learning, you can use neural networks to compute these unary
potentials.
Pairwise Potentials: These represent the dependencies between adjacent output labels.
They can also be modeled using neural networks or other learned functions.
The CRF model can be represented as a graphical model, where nodes represent the output
variables and edges represent pairwise dependencies.
4) Objective Function:
Define an objective function that measures the difference between the predicted structured output
and the ground truth. Common objective functions include the negative log-likelihood or structured
loss functions like the structured perceptron loss or structured hinge loss.
5) Training:
Train the CRF model using your labeled dataset and the defined objective function. This can be done
using techniques like stochastic gradient descent (SGD) or other optimization methods.
Backpropagation is used to compute gradients for the neural network components of the CRF. The
gradients are then used to update the model's parameters.
6) Regularization:
To prevent overfitting, you can apply regularization techniques such as L1 or L2 regularization to
the neural network components of the CRF.
7) Inference:
During inference, you use the trained CRF model to make predictions on new, unlabeled data. This
typically involves finding the most likely structured output given the input features. This can be done
using algorithms like Viterbi decoding or beam search.
8) Evaluation:
Evaluate the performance of your CRF model using appropriate metrics for your task, such as
accuracy, F1-score, or intersection-over-union (IoU) for segmentation tasks.
Training CRFs in deep learning allows you to take advantage of the expressive power of neural networks
for feature extraction while benefiting from the structured prediction capabilities of CRFs. This hybrid
approach is particularly useful for tasks where modeling dependencies between output variables is crucial for
accurate predictions.
20
Hidden Markov Model:
Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs) belong to the
family of graphical models in machine learning.
The word “Hidden” symbolizes the fact that only the symbols released by the system are observable,
while the user cannot view the underlying random walk between states. Many in this field recognize HMM
as a finite state machine.
Let’s imagine that we have an English sentence like Math is the language of nature. We
want to label each word with its part of speech. That way, we get a graph:
Let’s see what each color and arrow represent. The sentence is our observed data (shown in
gray circles) and the labels are the hidden information we want to extract (shown in white
circles). In addition, each word depends on its label and each label depends on the previous
label.
This graph represents a first-order hidden Markov model which belongs to the family
of Bayesian networks. The reason why we call it “first-order” is that each hidden variable
depends only on the previous one. We can extend it to a higher-order model by conditioning the
hidden variables on more than one predecessor.
The general first-order HMM graph for any sequence is similar to that of our example:
21
We can assign probabilities to each arrow and model the occurrence of such a sequence using
the product of these probabilities:
where and are the sequence and the array of its labels. The
are emission probabilities and the are transition probabilities. The former is the
set of probabilities for each observation given its corresponding state while the latter is the
probability of a state given its preceding state. Since an HMM models the joint probability
of and , it’s a generative model.
Example:
Let’s imagine we want to tag the words in the sentence I love programming. From our
training data, we can learn the emission and transition probabilities by counting. For instance,
let’s imagine that the training data consists of three sentences:
22
To find the optimal labels, we can go over all the combinations of states and observations to
find the sequence that has the maximum probability. For our example, the best sequence is:
Given that our example was a small one with only a couple of words and parts of speech, it’s
easy to enumerate all the possible combinations. However, this can become extremely difficult
to do in more complex and longer sentences. For them, we use inference algorithms such
as Viterbi or Forward-backward.
Advantages of HMM
HMM has a strong statistical foundation with efficient learning algorithms where learning can take
place directly from raw sequence data. It allows consistent treatment of insertion and deletion penalties in the
form of locally learnable methods and can handle inputs of variable length. They are the most flexible
generalization of sequence profiles. It can also perform a wide variety of operations including multiple
alignment, data mining and classification, structural analysis, and pattern discovery. It is also easy to
combine into libraries.
Disadvantages of HMM
HMM is only dependent on every state and its corresponding observed object:
The sequence labeling, in addition to having a relationship with individual words, also relates to such
aspects as the observed sequence length, word context and others.
The target function and the predicted target function do not match:
HMM acquires the joint distribution P(Y, X) of the state and the observed sequence, while in the
estimation issue, we need a conditional probability P(Y|X).
Entropy:
Entropy in conditional random fields (CRFs) is often used as a measure of uncertainty or disorder in
the context of deep learning and natural language processing. CRFs are a type of probabilistic graphical
model commonly used for sequence labeling tasks such as part-of-speech tagging, named entity recognition,
and more.
23
1) Model Training: During training, the entropy of a CRF's predictions can be used as a regularization
term. By encouraging the model to have lower entropy in its predictions, it can be guided towards
making more confident and accurate predictions.
2) Inference: When making predictions on new data, the entropy of the model's output can be used to
assess the uncertainty of the predictions. High entropy implies uncertainty, while low entropy
implies confidence.
3) Evaluation: Entropy can be used to evaluate the quality of a CRF model. Lower entropy on the test
data typically indicates better model performance.
In the context of deep learning, CRFs are often used as a structured output layer in neural networks, and
the entropy is computed over the conditional distribution of labels given the input features. This can be
helpful in various applications, especially when dealing with tasks that require handling complex
dependencies and structured output.
Overall, entropy in CRFs serves as a useful tool for managing uncertainty and enhancing the
performance of deep learning models, particularly in sequence labeling tasks.
24