Deep Neural Network

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

1.

Deep Neural Network


A deep neural network (DNN) is an ANN with multiple hidden layers between the input and
output layers. Similar to shallow ANNs, DNNs can model complex non-linear relationships.
The main purpose of a neural network is to receive a set of inputs, perform progressively
complex calculations on them, and give output to solve real world problems like classification.
We restrict ourselves to feed forward neural networks.
We have an input, an output, and a flow of sequential data in a deep network.

Neural networks are widely used in supervised learning and reinforcement learning problems.
These networks are based on a set of layers connected to each other.
In deep learning, the number of hidden layers, mostly non-linear, can be large; say about 1000
layers.
DL models produce much better results than normal ML networks.
We mostly use the gradient descent method for optimizing the network and minimising the loss
function.
We can use the Imagenet, a repository of millions of digital images to classify a dataset into
categories like cats and dogs. DL nets are increasingly used for dynamic images apart from static
ones and for time series and text analysis.
Training the data sets forms an important part of Deep Learning models. In addition,
Backpropagation is the main algorithm in training DL models.
DL deals with training large neural networks with complex input output transformations.
One example of DL is the mapping of a photo to the name of the person(s) in photo as they do
on social networks and describing a picture with a phrase is another recent application of DL.
Neural networks are functions that have inputs like x1,x2,x3…that are transformed to outputs
like z1,z2,z3 and so on in two (shallow networks) or several intermediate operations also called
layers (deep networks).
The weights and biases change from layer to layer. ‘w’ and ‘v’ are the weights or synapses of
layers of the neural networks.
The best use case of deep learning is the supervised learning problem.Here,we have large set of
data inputs with a desired set of outputs.

Here we apply back propagation algorithm to get correct output prediction.


The most basic data set of deep learning is the MNIST, a dataset of handwritten digits.
We can train deep a Convolutional Neural Network with Keras to classify images of handwritten
digits from this dataset.
The firing or activation of a neural net classifier produces a score. For example,to classify
patients as sick and healthy,we consider parameters such as height, weight and body
temperature, blood pressure etc.
A high score means patient is sick and a low score means he is healthy.
Each node in output and hidden layers has its own classifiers. The input layer takes inputs and
passes on its scores to the next hidden layer for further activation and this goes on till the output
is reached.
This progress from input to output from left to right in the forward direction is called forward
propagation.
Credit assignment path (CAP) in a neural network is the series of transformations starting from
the input to the output. CAPs elaborate probable causal connections between the input and the
output.
CAP depth for a given feed forward neural network or the CAP depth is the number of hidden
layers plus one as the output layer is included. For recurrent neural networks, where a signal may
propagate through a layer several times, the CAP depth can be potentially limitless.
Deep Nets and Shallow Nets
There is no clear threshold of depth that divides shallow learning from deep learning; but it is
mostly agreed that for deep learning which has multiple non-linear layers, CAP must be greater
than two.
Basic node in a neural net is a perception mimicking a neuron in a biological neural network.
Then we have multi-layered Perception or MLP. Each set of inputs is modified by a set of
weights and biases; each edge has a unique weight and each node has a unique bias.
The prediction accuracy of a neural net depends on its weights and biases.
The process of improving the accuracy of neural network is called training. The output from a
forward prop net is compared to that value which is known to be correct.
The cost function or the loss function is the difference between the generated output and the
actual output.
The point of training is to make the cost of training as small as possible across millions of
training examples.To do this, the network tweaks the weights and biases until the prediction
matches the correct output.
Once trained well, a neural net has the potential to make an accurate prediction every time.
When the pattern gets complex and you want your computer to recognise them, you have to go
for neural networks.In such complex pattern scenarios, neural network outperformsall other
competing algorithms.
There are now GPUs that can train them faster than ever before. Deep neural networks are
already revolutionizing the field of AI
Computers have proved to be good at performing repetitive calculations and following detailed
instructions but have been not so good at recognising complex patterns.
If there is the problem of recognition of simple patterns, a support vector machine (svm) or a
logistic regression classifier can do the job well, but as the complexity of patternincreases, there
is no way but to go for deep neural networks.
Therefore, for complex patterns like a human face, shallow neural networks fail and have no
alternative but to go for deep neural networks with more layers. The deep nets are able to do
their job by breaking down the complex patterns into simpler ones. For example, human face;
adeep net would use edges to detect parts like lips, nose, eyes, ears and so on and then re-
combine these together to form a human face
The accuracy of correct prediction has become so accurate that recently at a Google Pattern
Recognition Challenge, a deep net beat a human.
This idea of a web of layered perceptrons has been around for some time; in this area, deep nets
mimic the human brain. But one downside to this is that they take long time to train, a hardware
constraint
However recent high performance GPUs have been able to train such deep nets under a week;
while fast cpus could have taken weeks or perhaps months to do the same.
Choosing a Deep Net
How to choose a deep net? We have to decide if we are building a classifier or if we are trying to
find patterns in the data and if we are going to use unsupervised learning. To extract patterns
from a set of unlabelled data, we use a Restricted Boltzman machine or an Auto encoder.
Consider the following points while choosing a deep net −
 For text processing, sentiment analysis, parsing and name entity recognition, we use a
recurrent net or recursive neural tensor network or RNTN;
 For any language model that operates at character level, we use the recurrent net.
 For image recognition, we use deep belief network DBN or convolutional network.
 For object recognition, we use a RNTN or a convolutional network.
 For speech recognition, we use recurrent net.
In general, deep belief networks and multilayer perceptrons with rectified linear units or RELU
are both good choices for classification.
For time series analysis, it is always recommended to use recurrent net.
Neural nets have been around for more than 50 years; but only now they have risen into
prominence. The reason is that they are hard to train; when we try to train them with a method
called back propagation, we run into a problem called vanishing or exploding gradients.When
that happens, training takes a longer time and accuracy takes a back-seat. When training a data
set, we are constantly calculating the cost function, which is the difference between predicted
output and the actual output from a set of labelled training data.The cost function is then
minimized by adjusting the weights and biases values until the lowest value is obtained. The
training process uses a gradient, which is the rate at which the cost will change with respect to
change in weight or bias values.

2.Life cycle of Deep Learning Model

Deep learning (DL) is a subset of machine learning (ML). A deep learning model is capable of
learning by creating its own computing method. While it is easy to confuse machine learning
with deep learning - there is a difference between the two. Basic ML models become better as
you continue feeding them data, they would still require human guidance as the learning
algorithms are not that deep. However, with deep learning, the model determines the accuracy or
inaccuracy of its prediction based on its own neural network, optimizers, and loss functions.
Also, the models are deep with huge datasets in the DL which increases the training time
drastically.

The Deep Learning Life cycle

The DL life cycle is the cyclic process that is initiated to build an efficient machine learning
project. The main purpose of a DL project is to find a solution to a problem using the available
data. 

There are multiple steps in the DL end-to-end life cycle that follow a specific order

 DL Problem Statement

The first step begins by addressing that a problem exists and finding potential solutions that will
tangibly improve operations, reduce the cost, time, and increase customer satisfaction. For
example, in the construction industry, we now have PPE Kit detection, people safety monitoring,
site development time and maintenance, etc.

 DL Data Procurement

The second step is to collect the required data and prepare it to be used in deep learning. This
will imply consulting professionals in the construction industry to determine what data would be
relevant in detecting the PPE kit and where the cameras will be placed to effectively cover the
wide-area to accurately detect and predict the unusual mishaps happening. This stage covers the
data gathering for training or performing Transfer Learning and then putting it into a file format
that can be easily organized and understood by the model. 

 DL Data Pre-Processing

Data preprocessing is a Data Mining technique that entails converting raw data into a format that
can be understood by the DL models. Real-world data is often incomplete, unreliable, and/or
deficient in specific behaviors or patterns, as well as containing numerous errors and is also
biased for a particular problem statement. Preprocessing data is a tried and true way of
addressing such problems. Data pre-processing methods such as data labeling and annotation in
case of video and image data and data tagging in case of speech and text data is a very important
step for the DL life-cycle.

 DL Model Development

In order to gain insights into the data that you have collected and pre-processed, it is crucial that
you set your target variable- which is the main component that you want to gain deeper insights
from your data. For example, you want to detect how many vehicles are parked in the parking
garage at a construction site? We will have to train a DL model to detect the vehicles in the
parking from the CCTV camera. Sounds simple, isn’t it! Now, think if we want to detect how
many vehicles are cars, trucks, SUVs, etc? Moving ahead what is the brand and make of the
vehicle and what is the color of the vehicle? To perform all these analyses we need to perform
iterative learning of the model and need a model ensemble layer to give us the desired results.
The Deep in the DL stands for how deep the models are. 

 DL Model Optimization

After model training is completed the model needs to be optimized more for the specific case.
There are generally 3 types of model optimization:

1. Accuracy optimization
2. Memory optimization
3. Latency and throughput optimization

 DL Model Inference

After the model has been successfully trained & optimized, the task at hand is to now explain
these outcomes to people with very little knowledge in the field of data science. This can be
particularly difficult. In the earlier years, it was difficult to convey these insights to important
stakeholders in the business, but with time it has gotten easier. It is important to note that the
more interpretable your model is the easier it becomes to communicate its value and importance
to key stakeholders. 
 DL Model Governance

The last step in this process is to implement, document, version DL models, and ultimately
maintain the data science project so that it can be harnessed while simultaneously improving its
models. This step is also generally referred to as the model CI/CD pipeline and it is the longest
in terms of duration in the entire deep learning cycle. 

The DL life cycle gives us a better perspective on how a data science project should be
structured in order to add more value to a business or industry as a whole. Failing to execute any
of these steps could result in unnecessary biases and misleading results.

3. DenseNet

Dense net is densely connected-convolutional networks. It is very similar to a ResNet with


some-fundamental differences. ResNet is using an additive method that means they take a
previous output as an input for a future layer, & in DenseNet takes all previous output as an
input for a future layer as shown in the image.

So DenseNet was specially developed to improve accuracy caused by the vanishing gradient in
high-level neural networks due to the long distance between input and output layers & the
information vanishes before reaching its destination.
So suppose we have a capital L number of layers, in a typical network with L layers, there will
be L connections, that is, connections between the layers. However, in a DenseNet, there will be
about L and L plus one by two connections L(L+1)/2. So in a dense net, we have less number of
layers than the other model, so here we can train more than 100 layers of the model very easily
by using this technique.
DenseBlocks And Layers

Here as we go deeper into the network this becomes a kind of unsustainable, if you go 2nd layer
to 3rd layer so 3rd layer takes an input not only 2nd layer but it takes input all previous layers.
So let’s say we have about ten layers. Then the 10th layer will take us to input all the feature
maps from the preceding nine layers. Now, if each of these layers, let’s produce 128 feature
maps and there is a feature map explosion. to overcome this problem, we create a dense block
here and So each dense block contains a prespecified number of layers inside them.
and the output from that particular dense block is given to what is called a transition layer and
this layer is like one by one convolution followed by Max pooling to reduce the size of the
feature maps. So the transition layer allows for Max pooling, which typically leads to a reduction
in the size of your feature maps.
As a given fig, we can see two blocks first one is the convolution layer and the second is the
pooling layer, and combinations of both are the transition layer.
So following some Advantages of the dense net.
Parameter efficiency – Every layer adds only a limited number of parameters- for e.g. only
about 12 kernels are learned per layer
Implicit deep supervision – Improved flow of gradient through the network- Feature maps in
all layers have direct access to the loss function and its gradient.
Growth rate – This determines the number of feature maps output into individual layers inside
dense blocks.
Dense connectivity – By dense connectivity, we mean that within a dense block each layer gets
us input feature maps from the previous layer as seen in this figure.
Composite functions – So the sequence of operations inside a layer goes as follows. So we have
batch normalization, followed by an application of Relu, and then a convolution layer (that will
be one convolution layer)
Transition layers – The transition layers aggregate the feature maps from a dense block and
reduce its dimensions. So Max Pooling is enabled here.
4. Generalization error in DL
Machine learning researchers are consumers of optimization algorithms. Sometimes, we must
even develop new optimization algorithms. But at the end of the day, optimization is merely a
means to an end. At its core, machine learning is a statistical discipline and we wish to optimize
training loss only insofar as some statistical principle (known or unknown) leads the resulting
models to generalize beyond the training set.

On the bright side, it turns out that deep neural networks trained by stochastic gradient descent
generalize remarkably well across myriad prediction problems, spanning computer vision;
natural language processing; time series data; recommender systems; electronic health records;
protein folding; value function approximation in video games and board games; and countless
other domains. On the downside, if you were looking for a straightforward account of either the
optimization story (why we can fit them to training data) or the generalization story (why the
resulting models generalize to unseen examples), then you might want to pour yourself a drink.
While our procedures for optimizing linear models and the statistical properties of the solutions
are both described well by a comprehensive body of theory, our understanding of deep learning
still resembles the wild west on both fronts.
The theory and practice of deep learning are rapidly evolving on both fronts, with theorists
adopting new strategies to explain what’s going on, even as practitioners continue to innovate at
a blistering pace, building arsenals of heuristics for training deep networks and a body of
intuitions and folk knowledge that provide guidance for deciding which techniques to apply in
which situations.
The TL;DR of the present moment is that the theory of deep learning has produced promising
lines of attack and scattered fascinating results, but still appears far from a comprehensive
account of both (i) why we are able to optimize neural networks and (ii) how models learned by
gradient descent manage to generalize so well, even on high-dimensional tasks. However, in
practice, (i) is seldom a problem (we can always find parameters that will fit all of our training
data) and thus understanding generalization is far the bigger problem. On the other hand, even
absent the comfort of a coherent scientific theory, practitioners have developed a large collection
of techniques that may help you to produce models that generalize well in practice. While no
pithy summary can possibly do justice to the vast topic of generalization in deep learning, and
while the overall state of research is far from resolved, we hope, in this section, to present a
broad overview of the state of research and practice.
Revisiting Overfitting and Regularization
According to the “no free lunch” theorem by Wolpert et al. (1995), any learning algorithm
generalizes better on data with certain distributions, and worse with other distributions. Thus,
given a finite training set, a model relies on certain assumptions: to achieve human-level
performance it may be useful to identify inductive biases that reflect how humans think about
the world. Such inductive biases show preferences for solutions with certain properties. For
example, a deep MLP has an inductive bias towards building up a complicated function by
composing simpler functions together.
With machine learning models encoding inductive biases, our approach to training them
typically consists of two phases: (i) fit the training data; and (ii) estimate the generalization error
(the true error on the underlying population) by evaluating the model on holdout data. The
difference between our fit on the training data and our fit on the test data is called the
generalization gap and when the generalization gap is large, we say that our models overfit to the
training data. In extreme cases of overfitting, we might exactly fit the training data, even when
the test error remains significant. And in the classical view, the interpretation is that our models
are too complex, requiring that we either shrink the number of features, the number of nonzero
parameters learned, or the size of the parameters as quantified. Recall the plot of model
complexity vs loss (Fig. 3.6.1) from Section 3.6.

However deep learning complicates this picture in counterintuitive ways. First, for classification
problems, our models are typically expressive enough to perfectly fit every training example,
even in datasets consisting of millions (Zhang et al., 2021). In the classical picture, we might
think that this setting lies on the far right extreme of the model complexity axis, and that any
improvements in generalization error must come by way of regularization, either by reducing the
complexity of the model class, or by applying a penalty, severely constraining the set of values
that our parameters might take. But that is where things start to get weird.
Strangely, for many deep learning tasks (e.g., image recognition and text classification) we are
typically choosing among model architectures, all of which can achieve arbitrarily low training
loss (and zero training error). Because all models under consideration achieve zero training
error, the only avenue for further gains is to reduce overfitting. Even stranger, it is often the case
that despite fitting the training data perfectly, we can actually reduce the generalization error
further by making the model even more expressive, e.g., adding layers, nodes, or training for a
larger number of epochs. Stranger yet, the pattern relating the generalization gap to the
complexity of the model (as captured, e.g., in the depth or width of the networks) can be non-
monotonic, with greater complexity hurting at first but subsequently helping in a so-called
“double-descent” pattern (Nakkiran et al., 2021). Thus the deep learning practitioner possesses a
bag of tricks, some of which seemingly restrict the model in some fashion and others that
seemingly make it even more expressive, and all of which, in some sense, are applied to mitigate
overfitting.
Complicating things even further, while the guarantees provided by classical learning theory can
be conservative even for classical models, they appear powerless to explain why it is that deep
neural networks generalize in the first place. Because deep neural networks are capable of fitting
arbitrary labels even for large datasets, and despite the use of familiar methods like
regularization, traditional complexity-based generalization bounds, e.g., those based on the VC
dimension or
Inspiration from Nonparametrics
Approaching deep learning for the first time, it is tempting to think of them as parametric
models. After all, the models do have millions of parameters. When we update the models, we
update their parameters. When we save the models, we write their parameters to disk. However,
mathematics and computer science are riddled with counterintuitive changes of perspective, and
surprising isomorphisms seemingly different problems. While neural networks, clearly have
parameters, in some ways, it can be more fruitful to think of them as behaving like
nonparametric models. So what precisely makes a model nonparametric? While the name covers
a diverse set of approaches, one common theme is that nonparametric methods tend to have a
level of complexity that grows as the amount of available data grows.
Perhaps the simplest example of a nonparametric model is the -nearest neighbor algorithm (we
will cover more nonparametric models later, such as in Section 11.2). Here, at training time, the
learner simply memorizes the dataset. Then, at prediction time, when confronted with a new
point, the learner looks up the nearest neighbors (the points that minimize some distance). When,
this is algorithm is called 1-nearest neighbors, and the algorithm will always achieve a training
error of zero. That however, does not mean that the algorithm will not generalize. In fact, it turns
out that under some mild conditions, the 1-nearest neighbor algorithm is consistent (eventually
converging to the optimal predictor).

In a sense, because neural networks are over-parameterized, possessing many more parameters
than are needed to fit the training data, they tend to interpolate the training data (fitting it
perfectly) and thus behave, in some ways, more like nonparametric models. More recent
theoretical research has established deep connection between large neural networks and
nonparametric methods, notably kernel methods. In particular, Jacot et al. (2018) demonstrated
that in the limit, as multilayer perceptrons with randomly initialized weights grow infinitely
wide, they become equivalent to (nonparametric) kernel methods for a specific choice of the
kernel function (essentially, a distance function), which they call the neural tangent kernel.
While current neural tangent kernel models may not fully explain the behavior of modern deep
networks, their success as an analytical tool underscores the usefulness of nonparametric
modeling for understanding the behavior of over-parameterized deep networks.
Early Stopping
While deep neural networks are capable of fitting arbitrary labels, even when labels are assigned
incorrectly or randomly (Zhang et al., 2021), this ability only emerges over many iterations of
training. A new line of work (Rolnick et al., 2017) has revealed that in the setting of label noise,
neural networks tend to fit cleanly labeled data first and only subsequently to interpolate the
mislabeled data. Moreover, it is been established that this phenomenon translates directly into a
guarantee on generalization: whenever a model has fitted the cleanly labeled data but not
randomly labeled examples included in the training set, it has in fact generalized (Garg et al.,
2021).
Together these findings help to motivate early stopping, a classic technique for regularizing deep
neural networks. Here, rather than directly constraining the values of the weights, one constrains
the number of epochs of training. The most common way to determine the stopping criteria is to
monitor validation error throughout training (typically by checking once after each epoch) and to
cut off training when the validation error has not decreased by more than some small amount for
some number of epochs. This is sometimes called a patience criteria. Besides the potential to
lead to better generalization, in the setting of noisy labels, another benefit of early stopping is the
time saved. Once the patience criteria is met, one can terminate training. For large models that
might require days of training simultaneously across 8 GPUs or more, well-tuned early stopping
can save researchers days of time and can save their employers many thousands of dollars.
Notably, when there is no label noise and datasets are realizable (the classes are truly separable,
e.g., distinguishing cats from dogs), early stopping tends not to lead to significant improvements
in generalization. On the other hand, when there is label noise, or intrinsic variability in the label
(e.g., predicting mortality among patients), early stopping is crucial. Training models until they
interpolate noisy data is typically a bad idea.
Classical Regularization Methods for Deep Networks
In Section 3, we described several classical regularization techniques for constraining the
complexity of our models. In particular, Section 3.7 introduced a method called weight decay,
which consists of adding a regularization term to the loss function to penalize large values of the
weights. Depending on which weight norm is penalized this technique is known either as ridge
regularization (for penalty) or lasso regularization (for an penalty). In the classical analysis of
these regularizers, they are considered to restrict the values that the weights can take sufficiently
to prevent the model from fitting arbitrary labels.

In deep learning implementations, weight decay remains a popular tool. However, researchers
have noted that typical strengths of regularization are insufficient to prevent the networks from
interpolating the data and thus the benefits if interpreted as regularization might only make sense
in combination with the early stopping criteria. Absent early stopping, it is possible that just like
the number of layers or number of nodes (in deep learning) or the distance metric (in 1-nearest
neighbour), these methods may lead to better generalization not because they meaningfully
constrain the power of the neural network but rather because they somehow encode inductive
biases that are better compatible with the patterns found in datasets of interests. Thus, classical
regularizes remain popular in deep learning implementations, even if the theoretical rationale for
their efficacy may be radically different.
Notably, deep learning researchers have also built on techniques first popularized in classical
regularization contexts, such as adding noise to model inputs. In the next section we will
introduce the famous dropout technique (invented by Srivastava et al. (2014)), which has
become a mainstay of deep learning, even as the theoretical basis for its efficacy remains
similarly mysterious.

5. Mastering the Game of Go without Human Knowledge


The problem of developing a pure reinforcement learning approach (i.e. without the use of any
domain knowledge beyond basic game rules) for the game of Go has been addressed.

Why is that a problem worth solving?

AlphaGo Fan and AlphaGo Lee successfully outperformed the best players of the game of Go in
recent years. A limitation of these approaches is that they require extensive knowledge of human
expert moves. It can be difficult, or even infeasible, to obtain such data sets for many real-world
problems. An approach that learns purely through self-play achieving stable and faster learning
would be a huge step in overcoming these limitations. Such an approach would pave the way for
solving many real-world games whose datasets are not readily available.  

Approach:

The proposed approach called AlphaGo Zero (a tabula rasa algorithm) has a single neural
network that acts as both a policy network (outputs action probabilities) as well as a value
network (outputs action values) as opposed to two separate networks in AlphaGo. It merely uses
the positions of white and black stones on the board as input features and basic game rules as
domain knowledge without any requirement of human expert moves. 

Given neural network parameters and the current state of the game, Monte-Carlo Tree Search
(MCTS) computes a vector of search probabilities and recommends moves to play. In other
words, the trained neural network is used to guide MCTS to select the most promising moves.
These moves are then used by the neural network to self-play and find the game-winner. The
network parameters are updated to maximize the similarity of the neural network’s action
probabilities to the search probabilities and to minimize the error between the predicted value
and the self-play winner. Thus, the approach represents repeated self-play or a policy iteration
procedure with MCTS (without rollouts) improving the policy and the neural network self-play
evaluating the policy.

Finally, the authors explain the architecture in detail and suggest through their extensive
empirical evaluation of pure RL AlphaGo Zero with previous supervised learning approaches
that pure RL can achieve higher move prediction accuracies, learn faster with more stability, and
discover novel game strategies. 

Limitations:

 The paper uses simulations that exactly emulate a real game of Go. It will be interesting
to see how it performs for games with partial or noisy observations.
 The game Go is also fully deterministic. Hence, Go isn’t the most challenging game to
claim that pure reinforcement learning is fully feasible for real-world domains. 
 Although labels are being learnt and used to improve the network (closer to supervised
learning), that perspective has not been mentioned. Policy iteration has been used which
is not RL. But reward function and transition function have not been provided, it’s RL.
So its RL + Search (MCTS)

Strengths:

 I like how the neural network and MCTS are combined in an elegant policy iteration
framework to achieve stable learning. 
 The proposed approach eliminates the need for rollouts by using the neural network to
evaluate positions. It also uses a single network and doesn’t require any human expert
knowledge except for basic game rules. 
 The most notable result that struck out to me is that AlphaGo Zero not only learned
existing human knowledge on its own but also discovered novel creative strategies. 

You might also like